Lecture 5: Logistic Regression. Neural Networks

Save this PDF as:

Size: px
Start display at page:

Transcription

1 Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture 5 - September 21,

2 Recall: Binary classification We are given a set of data x i, y i with y i {0, 1}. The probability of a given input x to have class y = 1 can be computed as: where P (y = 1 x) = a = ln P (x, y = 1) P (x) = P (x y = 1)P (y = 1) P (x y = 0)P (y = 0) exp( a) = σ(a) σ is the sigmoid function (also called squashing ) function a is the log-odds of the data being class 1 vs. class 0 COMP-652, Lecture 5 - September 21,

3 Recall: Modelling for binary classification P (y = 1 x) = σ ( ln ) P (x y = 1)P (y = 1) P (x y = 0)P (y = 0) One approach is to model P (y) and P (x y), then use the approach above for classification (naive Bayes, Gaussian discriminants) This is called generative learning, because we can actually use the model to generate (i.e. fantasize) data Another idea is to model directly P (y x) This is called discriminative learning, because we only care about discriminating (i.e. separating) examples of the two classes. We focus on this approach today COMP-652, Lecture 5 - September 21,

4 Error function Maximum likelihood classification assumes that we will find the hypothesis that maximizes the (log) likelihood of the training data: arg max h log P ( x i, y i i=1...m h) = arg max h n log P (x i,, y i ) h) i=1 (using the usual i.i.d. assumption) But if we use discriminative learning, we have no model of the input distribution Instead, we want to maximize the conditional probability of the labels, given the inputs: m arg max log P (y i x i, h) h i=1 COMP-652, Lecture 5 - September 21,

5 The cross-entropy error function Suppose we interpret the output of the hypothesis, h(x i ), as the probability that y i = 1 Then the log-likelihood of a hypothesis h is: log L(h) = = m log P (y i x i, h) = i=1 m i=1 { log h(xi ) if y i = 1 log(1 h(x i )) if y i = 0 m y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 The cross-entropy error function is the opposite quantity: ( m ) J(h) = y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 COMP-652, Lecture 5 - September 21,

6 Logistic regression Suppose we represent the hypothesis itself as a logistic function of a linear combination of inputs: h(x = exp(w T x) This is also known as a sigmoid neuron Suppose we interpret h(x) as P (y = 1 x) Then the log-odds ratio, ( ) P (y = 1 x) ln = w T x P (y = 0 x) which is linear (nice!) We can find the optimum weight by minimizing cross-entropy (maximizing the conditional likelihood of the data) COMP-652, Lecture 5 - September 21,

7 Cross-entropy error surface for logistic function ( m ) J(h) = y i log σ(w T x i ) + (1 y i ) log(1 σ(w T x i )) i=1 Nice error surface, unique minimum, but cannot solve in closed form COMP-652, Lecture 5 - September 21,

8 Maximization procedure: Gradient ascent First we compute the gradient of log L(w) wrt w: log L(w) = 1 y i h i w (x i ) h w(x i )(1 h w (x i ))x i 1 + (1 y i ) 1 h w (x i ) h w(x i )(1 h w (x i ))x i ( 1) = i x i (y i y i h w (x i ) h w (x i ) + y i h w (x i )) = i (y i h w (x i ))x i The update rule (because we maximize) is: m w w + α log L(w) = w + α (y i h w (x i ))x i where α (0, 1) is a step-size or learning rate parameter This is called logistic regression i=1 COMP-652, Lecture 5 - September 21,

9 Another algorithm for optimization Recall Newton s method for finding the zero of a function g : R R At point w i, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals 0, and move the parameter to this point: w i+1 = w i g(wi ) g (w i ) COMP-652, Lecture 5 - September 21,

10 Application to machine learning Suppose for simplicity that the error function J has only one parameter We want to optimize J, so we can apply Newton s method to find the zeros of J = d dw J We obtain the iteration: w i+1 = w i J (w i ) J (w i ) Note that there is no step size parameter! This is a second-order method, because it requires computing the second derivative But, if our error function is quadratic, this will find the global optimum in one step! COMP-652, Lecture 5 - September 21,

11 Second-order methods: Multivariate setting If we have an error function J that depends on many variables, we can compute the Hessian matrix, which contains the second-order derivatives of J: H ij = 2 J w i w j The inverse of the Hessian gives the optimal learning rates The weights are updated as: w w H 1 w J This is also called Newton-Raphson method COMP-652, Lecture 5 - September 21,

12 Which method is better? Newton s method usually requires significantly fewer iterations than gradient descent Computing the Hessian requires a batch of data, so there is no natural on-line algorithm Inverting the Hessian explicitly is expensive, necessary but almost never Computing the product of a Hessian with a vector can be done in linear time (Schraudolph, 1994) COMP-652, Lecture 5 - September 21,

13 Newton-Raphson for logistic regression Leads to a nice algorithm called recursive least squares The Hessian has the form: H = Φ T RΦ where R is the diagonal matrix of h(x i )(1 h(x i )) The weight update becomes: w (Φ T RΦ) 1 Φ T R(Φw R 1 (Φw y) COMP-652, Lecture 5 - September 21,

14 Logistic vs. Gaussian Discriminant Comprehensive study done by Ng and Jordan (2002) If the Gaussian assumption is correct, as expected, Gaussian discriminant is better If the assumption is violated, Gaussian discriminant suffers from bias (unlike logistic regression) In practice, Gaussian discriminant, tends to converge quicker to a less helpful solution Note that they optimize different error function! COMP-652, Lecture 5 - September 21,

15 From neurons to networks Logistic regression can be used to learn linearly separable problems x 2 + x x x 1 (a) If the instances are not linearly separable, the function cannot be learned correctly (e.g. XOR) One solution is to provide fixed, non-linear basis functions instead of inputs (like we did with linear regression) Today: learn such non-linear functions from data (b) COMP-652, Lecture 5 - September 21,

16 Example: Logical functions of two variables Sigmoid Unit for And Function mse-curve-0.1 mse-curve-0.01 mse-curve xor-no-hidden Mean Square Error Squared Error Number of Epochs Number of epochs One sigmoid neuron can learn the AND function (left) but not the XOR function (right) In order to learn discrimination in data sets that are not linearly separable, we need networks of sigmoid units COMP-652, Lecture 5 - September 21,

17 Example: A network representing the XOR function 1 0 w30 (-3.06) Input 1 1 (6.92) w31 3 w32 (6.91) w41 (4.5) (10.28) w51 w50 (-4.86) 5 Ouput Learning Curve for XOR Function with Architecture mse-curve Input w42 (4.36) w40 (-6.8) w52 (-10.34) Squared Error Input1 Input2 o3 o4 Ouput Number of Epochs COMP-652, Lecture 5 - September 21,

18 Feed-forward neural networks Hidden Layer 1 1 i Oi=Xji Input Layer Output Layer Wji Input 1 j... Input n... 1 A collection of units (neurons) with sigmoid or sigmoid-like activations, arranged in layers Layers 0 is the input layer, and its units just copy the inputs (by convention) The last layer, K, is called output layer, since its units provide the result of the network Layers 1,... K 1 are usually called hidden layers (their presence cannot be detected from outside the network) 1 COMP-652, Lecture 5 - September 21,

19 Why this name? In feed-forward networks the outputs of units in layer k become inputs for units in layers with index > k There are no cross-connections between units in the same layer There are no backward ( recurrent ) connections from layers downstream Typically, units in layer k provide input only to units in layer k + 1 In fully connected networks, all units in layer k are connected to all units in layer k + 1 COMP-652, Lecture 5 - September 21,

20 Notation Hidden Layer Input Layer 1 i Oi=Xji Wji 1 Output Layer Input 1 j... Input n w j,i is the weight on the connection from unit i to unit j By convention, x j,0 = 1, j The output of unit j, denoted o j, is computed using a sigmoid: o j = σ(w T j x j) where w j is the vector of weights on the connections entering unit j and x j is the vector of inputs to unit j By the definition of the connections, x j,i = o i COMP-652, Lecture 5 - September 21,

21 Computing the output of the network Suppose that we want the network to make a prediction for instance x, y In a feed-forward network, this can be done in one forward pass: For layer k = 1 to K 1. Compute the output of all neurons in layer k: o j σ(w T j x j ), j Layer k 2. Copy these outputs as inputs to the next layer: x j,i o i, i Layer k, j Layer k + 1 COMP-652, Lecture 5 - September 21,

22 Expressiveness of feed-forward neural networks A single sigmoid neuron has the same representational power as a perceptron: Boolean AND, OR, NOT, but not XOR Every Boolean function can be represented by a network with a single hidden layer, but might require a number of hidden units that is exponential in the number of inputs Every bounded continuous function can be approximated with arbitrary precision by a network with one, sufficiently large hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. COMP-652, Lecture 5 - September 21,

23 Learning in feed-forward neural networks Usually, the network structure (units and interconnections) is specified by the designer The learning problem is finding a good set of weights The answer: gradient descent, because the form of the hypothesis formed by the network, h w, is Differentiable! Because of the choice of sigmoid units Very complex! Hence, direct computation of the optimal weights is not possible COMP-652, Lecture 5 - September 21,

24 Backpropagation algorithm Just gradient descent over all weights in the network! We put together the two phases described above: 1. Forward pass: Compute the outputs of all units in the network, o k, k = N + 1,... N + H + 1, going in increasing order of the layers 2. Backward pass: Compute the δ k updates described before, going from k = N + H + 1 down to k = N + 1 (in decreasing order of the layers) 3. Update to all the weights in the network: w i,j w i,j + α i,j δ i x i,j For computing probabilities, drop the o(1 o) terms from the δ COMP-652, Lecture 5 - September 21,

25 Backpropagation algorithm 1. Initialize all weights to small random numbers. 2. Repeat until satisfied: (a) Pick a training example (b) Input example to the network and compute the outputs o k (c) For each output unit k, δ k o k (1 o k )(y k o k ) (d) For each hidden unit l δ l o l (1 o l ) k outputs w lk δ k (e) Update each network weight: w ij w ij + α ij δ j x ij x ij is the input from unit i into unit j (for the output neurons, these are signals received from the hidden layer neurons) α ij is the learning rate or step size COMP-652, Lecture 5 - September 21,

26 Backpropagation variations The previous version corresponds to incremental (stochastic) gradient descent An analogous batch version can be used as well: Loop through the training data, accumulating weight changes Update weights One pass through the data set is called epoch Algorithm can be easily generalized to predict probabilities, instead of minimizing sum-squared error Generalization is easy for other network structures as well. COMP-652, Lecture 5 - September 21,

27 Convergence of backpropagation Backpropagation performs gradient descent over all the parameters in the network Hence, if the learning rate is appropriate, the algorithm is guaranteed to converge to a local minimum of the cost function NOT the global minimum Can be much worse than global minimum There can be MANY local minima (Auer et al, 1997) Partial solution: restarting = train multiple nets with different initial weights. In practice, quite often the solution found is very good COMP-652, Lecture 5 - September 21,

28 Adding momentum On the t-th training sample, instead of the update: we do: w ij α ij δ j x ij w ij (t) α ij δ j x ij + β w ij (t 1) The second term is called momentum Advantages: Easy to pass small local minima Keeps the weights moving in areas where the error is flat Increases the speed where the gradient stays unchanged Disadvantages: COMP-652, Lecture 5 - September 21,

29 With too much momentum, it can get out of a global maximum! One more parameter to tune, and more chances of divergence COMP-652, Lecture 5 - September 21,

30 Choosing the learning rate Backprop is VERY sensitive to the choice of learning rate Too large divergence Too small VERY slow learning The learning rate also influences the ability to escape local optima Very often, different learning rates are used for units inn different layers It can be shown that each unit has its own optimal learning rate COMP-652, Lecture 5 - September 21,

31 Adjusting the learning rate: Delta-bar-delta Heuristic method, works best in batch mode (though there have been attempts to make it incremental) The intuition: If the gradient direction is stable, the learning rate should be increased If the gradient flips to the opposite direction the learning rate should be decreased A running average of the gradient and a separate learning rate is kept for each weight If the new gradient and the old average have the same sign, increase the learning rate by a constant amount If they have opposite sign, decay the learning rate exponentially COMP-652, Lecture 5 - September 21,

32 How large should the network be? Overfitting occurs if there are too many parameters compared to the amount of data available Choosing the number of hidden units: Too few hidden units do not allow the concept to be learned Too many lead to slow learning and overfitting If the n inputs are binary, log n is a good heuristic choice Choosing the number of layers Always start with one hidden layer Never go beyond 2 hidden layers, unless the task structure suggests something different COMP-652, Lecture 5 - September 21,

33 Overtraining in feed-forward networks Error Error versus weight updates (example 1) Training set error Validation set error Number of weight updates Traditional overfitting is concerned with the number of parameters vs. the number of instances In neural networks there is an additional phenomenon called overtraining which occurs when weights take on large magnitudes, pushing the sigmoids into saturation Effectively, as learning progresses, the network has more actual parameters Use a validation set to decide when to stop training! COMP-652, Lecture 5 - September 21,

34 Finding the right network structure Destructive methods start with a large network and then remove (prune) connections Constructive methods start with a small network (e.g. 1 hidden unit) and add units as required to reduce error COMP-652, Lecture 5 - September 21,

35 Destructive methods Simple solution: consider removing each weight in turn (by setting it to 0), and examine the effect on the error Weight decay: give each weight a chance to go to 0, unless it is needed to decrease error: where λ is a decay rate Optimal brain damage: w j = α j J w j λw j Train the network to a local optimum Approximate the saliency of each link or unit (i.e., its impact on the performance of the network), using the Hessian matrix Greedily prune the element with the lowest saliency This loop is repeated until the error starts to deteriorate COMP-652, Lecture 5 - September 21,

36 Constructive methods Dynamic node creation (Ash): Start with just one hidden unit, train using backprop If the error is still high, add another unit in the same layer and repeat Only one layer is constructed Meiosis networks (Hanson): Start with just one hidden unit, train using backprop Compute the variance of each weight during training If a unit has one or more weights of high variance, it is split into two units, and the weights are perturbed The intuition is that the new units will specialize to different functions. Cascade correlation: Add units to correlate with the residual error COMP-652, Lecture 5 - September 21,

37 Example applications Speech phoneme recognition [Waibel] and synthesis [Nettalk] Image classification [Kanade, Baluja, Rowley] Digit recognition [Bengio, Bouttou, LeCun et al - LeNet] Financial prediction Learning control [Pomerleau et al] COMP-652, Lecture 5 - September 21,

38 When to consider using neural networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued, or a vector of values Possibly noisy data Training time is not important Form of target function is unknown Human readability of result is not important The computation of the output based on the input has to be fast COMP-652, Lecture 5 - September 21,

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

Artificial Neural Networks

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4 5 Scene recognition time ~ :1 second 100 inference steps doesn't seem like

Artificial Neural Networks

0 Artificial Neural Networks Based on Machine Learning, T Mitchell, McGRAW Hill, 1997, ch 4 Acknowledgement: The present slides are an adaptation of slides drawn by T Mitchell PLAN 1 Introduction Connectionist

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

Neural networks. Chapter 19, Sections 1 5 1

Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

Artificial Neural Networks

Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

Neural networks. Chapter 20. Chapter 20 1

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

Neural networks. Chapter 20, Section 5 1

Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

AI Programming CS F-20 Neural Networks

AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

Artificial Neural Networks

Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on

Machine Learning

Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB Learning highly non-linear functions f:

Neural Networks and Deep Learning

Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

Regression with Numerical Optimization. Logistic

CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

Machine Learning

Machine Learning 10-601 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

COMP-4360 Machine Learning Neural Networks

COMP-4360 Machine Learning Neural Networks Jacky Baltes Autonomous Agents Lab University of Manitoba Winnipeg, Canada R3T 2N2 Email: jacky@cs.umanitoba.ca WWW: http://www.cs.umanitoba.ca/~jacky http://aalab.cs.umanitoba.ca

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

1 2 The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural networks. First we will look at the algorithm itself

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

Neural Nets Supervised learning

6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w

Neural Network Training

Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

Introduction to Machine Learning

Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

+ Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

Machine Learning Lecture 10

Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

4. Multilayer Perceptrons

4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Back-Propagation Algorithm Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples 1 Inner-product net =< w, x >= w x cos(θ) net = n i=1 w i x i A measure

Multiclass Logistic Regression

Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

Artificial Neural Networks 2

CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

Linear discriminant functions

Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

Computational statistics

Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

Multilayer Perceptron

Aprendizagem Automática Multilayer Perceptron Ludwig Krippahl Aprendizagem Automática Summary Perceptron and linear discrimination Multilayer Perceptron, nonlinear discrimination Backpropagation and training

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

Machine Learning. Neural Networks

Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

Introduction to Neural Networks

CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

The exam is closed book, closed notes except your one-page cheat sheet.

CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

Machine Learning Basics III

Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12 Machine Learning Lesson 39 Neural Networks - III 12.4.4 Multi-Layer Perceptrons In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries, but the boundaries

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

Artificial Neuron (Perceptron)

9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

Statistical Machine Learning from Data

January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements: