Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain

Size: px
Start display at page:

Download "Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain"

Transcription

1 Generalization

2 Generalization A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain 2

3 Generalization The network input-output mapping is accurate for the training data and for test data never seen before. The network interpolates well. 3

4 Cause of Overfitting Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance we need to find the least complex network that can represent the data (Ockham s Razor). 4

5 Ockham s Razor Find the simplest model that explains the data. 5

6 Problem Statement Training Set { p, t } p 2 t 2, {, },,{ p Q, t Q } Underlying Function t q g( p q ) + ε q Performance Function Q F( x ) E D ( t q a q ) T ( t q a q ) q 6

7 Poor Generalization Overfitting Extrapolation Interpolation 7

8 Good Generalization Interpolation Extrapolation 8

9 Extrapolation in 2-D 9

10 Measuring Generalization Test Set Part of the available data is set aside during the training process. After training, the network error on the test set is used as a measure of generalization ability. The test set must never be used in any way to train the network, or even to select one network from a group of candidate networks. The test set must be representative of all situations for which the network will be used. 0

11 Methods for Improving Generalization Pruning (removing neurons) until the performance is degraded. Growing (adding neurons) until the performance is adequate. Validation Methods Regularization

12 Early Stopping Break up data into training, validation, and test sets. Use only the training set to compute gradients and determine weight updates. Compute the performance on the validation set at each iteration of training. Stop training when the performance on the validation set goes up for a specified number of iterations. Use the weights which achieved the lowest error on the validation set. 2

13 Early Stopping Example F(x) Validation a b Training a b 3

14 Regularization Standard Performance Measure F E D Performance Measure with Regularization F βe D α E W Complexity Penalty Q ( ) 2 + α x i + β ( t q a q ) T t q a q q n i (Smaller weights means a smoother function.) 4

15 Effect of Weight Changes w, w, 2 w, 2 2 w,

16 Effect of Regularization α/β 0 α/β 0.0 α/β 0.25 α/β 6

17 Bayes Rule PAB ( ) PBA ( )P( A) PB ( ) P(A) Prior Probability. What we know about A before B is known. P(A B) Posterior Probability. What we know about A after we know the outcome of B. P(B A) Conditional Probability (Likelihood Function). Describes our knowledge of the system. P(B) Marginal Probability. A normalization factor. 7

18 Example Problem % of the population have a certain disease. A test for the disease is 80% accurate in detecting the disease in people who have it. 0% of the time the test yields a false positive. If you have a positive test, what is your probability of having the disease? 8

19 Bayesian Analysis PAB ( ) PBA ( )P( A) PB ( ) A Event that you have the disease. B Event that you have a positive test. P(A) 0.0 P(B A) 0.8 P(B) P(B A)P(A) + P(B ~A)P(~A) PAB ( ) PBA ( )P( A) PB ( )

20 Signal Plus Noise Example t x+ ε f( ε) ε 2 exp πσ 2σ 2 fx ( ) exp πσ x 2σ x x 2 ftx ( ) ( t x) 2 exp πσ 2σ 2 fxt ( ) ftx ( )f( x) ft () f( xt) 0.4 f( tx) 0.2 f( x)

21 NN Bayesian Framework MP Posterior (MacKay 92) ML Likelihood Prior P( x D, αβm,, ) PDx (, β, M)P( x α, M) - PDαβM (,, ) Normalization (Evidence) D - Data Set M - Neural Network Model x - Vector of Network Weights 2

22 Gaussian Assumptions Gaussian Noise PDx (, β, M) exp( βe Z D ( β) D ) Z D ( β) ( 2πσ ε ) N 2 ( π β ) N 2 P( x α, M) Gaussian Prior: exp( αe Z W ( α ) W ) Z W ( α) ( 2πσ w ) n 2 ( π α) n 2 P( x D, αβm,, ) exp ( ( βe Z W ( α) Z D ( β ) D + αe W )) Normalization Factor exp( F( x ) Z F ( αβ, ) F β E D + α E W Minimize F to Maximize P. 22

23 Optimizing Regularization Parameters Evidence from First Level Second Level of Inference P( αβd,, M) P( D α, β, M)P ( α, β M) P( D M) Evidence: PDαβM (,, ) PDx (, β, M)P x ( α, M) P( x D, αβm,, ) exp ( βe Z D ( β ) D ) exp ( α E Z W ( α) W ) Z exp( F ( αβ, ) F ( x )) Z F ( αβ, ) Z D ( β )Z W ( α) exp( βe D α E W ) exp ( F( x ) Z F ( αβ, ) Z D ( β)z W ( α ) Z F ( αβ, ) is the only unknown in this expression. 23

24 Quadratic Approximation Taylor series expansion: F( x ) F( x MP ) ( x x MP ) T H MP ( x x MP ) H β 2 E D + α 2 E W Substituting into previous posterior density function: P( x D, αβm,, ) P( x D, α, β, M) P x ( ) --- exp F( x MP ) Z F 2 -( x x MP ) T H MP x x MP ( ) ---exp F( x Z MP ( )) exp -- x x 2 ( MP ) T H MP ( x x MP ) F Equate with standard Gaussian density: exp ( 2π ) n ( H MP ) 2 ( ) -( x x MP ) T H MP x x MP Comparing to previous equation, we have: Z F ( αβ, ) ( 2π) n 2 ( det( ( H MP ) )) 2 exp( F( x MP )) 24

25 Optimum Parameters If we make this substitution for Z F in the expression for the evidence and then take the derivative with respect to α and β to locate the minimum we find: α MP γ 2E W ( x MP ) β MP N γ 2E D ( x MP ) Effective Number of Parameters γ n 2α MP tr( H MP ) 25

26 Gauss-Newton Approximation It can be expensive to compute the Hessian matrix. Try the Gauss-Newton Approximation. H 2 F( x) 2βJ T J + 2αI n This is readily available if the Levenberg-Marquardt algorithm is used for training. 26

27 Algorithm (GNBR) 0. Initialize α, β and the weights.. Take one step of Levenberg-Marquardt to minimize F(w). 2. Compute the effective number of parameters γ n - 2αtr(H - ), using the Gauss-Newton approximation for H. 3. Compute new estimates of the regularization parameters α γ/(2e W ) and β (N-γ)/(2E D ). 4. Iterate steps -3 until convergence. 27

28 Checks of Performance If γ is very close to n, then the network may be too small. Add more hidden layer neurons and retrain. If the larger network has the same final γ, then the smaller network was large enough. Otherwise, increase the number of hidden neurons. If a network is sufficiently large, then a larger network will achieve comparable values for γ, E D and E W. 28

29 GNBR Example α β

30 Convergence of GNBR E D E D Training Testing Iteration Iteration γ α/β Iteration Iteration 30

31 Relationship between Early Stopping and Regularization 3

32 Linear Network Input Linear Neuron R p R x W S x R b S x a S n S x S x a purelin( Wp + b) Wp + b a purelin (Wp + b) a i purelin( n i ) purelin( w T i p + b i ) w T i p + b i i w w i, w i 2, w i R, 32

33 Performance Index Training Set: { p, t }, { p 2, t 2 },, { p Q, t Q } Input: p q Target: t q Notation: x w z p a w T p + b a x T z b Mean Square Error: F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] E D 33

34 Error Analysis F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] F( x ) E[ t 2 2tx T z + x T zz T x ] F( x ) E[ t 2 ] 2x T E[ tz] + x T E[ zz T ]x F( x) c 2 x T h + x T Rx c E[ t 2 ] h E[ tz] R E[ zz T ] The mean square error for the Linear Network is a quadratic function: F( x) c + d T x + -x T Ax 2 d 2h A 2R 34

35 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a p 2, t 2 F( x) c 2x T h + x T Rx (Probability 0.25) E D a purelin (Wp + b) c E[ t 2 ] ( ) 2 ( 0.75) + ( ) 2 ( 0.25 ) h Etz [ ] ( 0.75) ( ) + ( 0.25 )( ) 0.5 R E[ zz T T T ] p p ( 0.75) + p2 p 2 ( 0.25)

36 Performance Contour Optimum Point (Maximum Likelihood) Hessian Matrix x ML A d R h F( x) A 2R 2 2 A λi Eigenvalues 2 λ λ 2 4λ + 3 ( λ ) ( λ 3) λ, λ2 3 2 λ Eigenvectors A λi v 0 λ v 0 v λ 2 3 v 2 0 v 2 36

37 Contour Plot of E D γ n 2α MP tr( H MP ) 37

38 Steepest Descent Trajectory x k + x k αg k x k α( Ax k + d) x k αax ( k + A d) x k αax ( k x ML ) [ I αa]x k + αax ML Mx k + [ I M]x ML M [ I αa] x Mx 0 + [ I M]x ML x 2 Mx + [ I M]x ML M 2 x 0 + MI [ M]x ML +[ I M]x ML M 2 x 0 + Mx ML M 2 x ML + x ML Mx ML M 2 x 0 + x ML M 2 x ML M 2 2 x0 + [ I M ]x ML x k M k x 0 + [ I M k ]x M L 38

39 Regularization F( x) E D + ρe W (ρ α/β) E W -- ( x x 2 0 ) T ( x x 0 ) To locate the minimum point, set the gradient to zero. ( x x 0 ) E D Ax ( x ML ) E W F x ( ) E D + ρ E W F ( x ) Ax ( x M L ) ρ x x 0 + ( ) 0 39

40 MAP ML Ax MAP ( x ML ) ρ( x MAP x 0 ) ρ x MAP x ML + x ML ( x 0 ) x ML ρ x ML x 0 ρ( x MAP ) ( ) ( A + ρi )( x MAP x ML ) ρ( x 0 x ML ) x MAP x ML ( ) ρ A ρi ( + ) ( x 0 x ML ) x MAP x ML ρ( A + ρi ) x ML + ρ( A + ρi ) x 0 x ML M ρ x ML + M ρ x 0 M ρ ρ( A + ρi ) x MAP M ρ x 0 + [ I M ρ ]x ML 40

41 Early Stopping Regularization x k M k x 0 + [ I M k ]x M L x MAP M ρ x 0 + [ I M ρ ]x ML M [ I αa] M ρ ρ( A + ρi ) Eigenvalues of M k : [ I αa]z i z i αaz i z i α λ i z i ( α λ i )z i z i λ i - eigenvector of A - eigenvalue of A eig( M k ) ( αλ i ) k Eigenvalues of M Eigenvalues of M γ : eig( M ρ ) ρ ( + ρ) λ i 4

42 Reg. Parameter Iteration Number M k and M ρ have the same eigenvectors. They would be equal if their eigenvalues were equal. ρ ( αλ ( + ρ) i ) k λ Taking log : i log + -- ρ klog( αλ i ) λ i Since these are equal at λ i 0, they are always equal if the slopes are equal λ + -- i ρ - ρ k ( α) αλ i α k -- ρ ( αλ i ) ( + λ i ρ) If αλ i and λ i /ρ are small, then: αk -- ρ (Increasing the number of iterations is equivalent to decreasing the regularization parameter!) 42

43 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a purelin (Wp + b) a p 2, t 2 F ( x ) E D + ρe W (Probability 0.25) E D c + x T d x T Ax E W -x T x 2 2 d 2h A 2R c F( x) 2E D + ρ 2E 2 W + ρ ρ 2 + ρ 43

44 ρ 0, 2,

45 ρ

46 Steepest Descent Path

47 Effective Number of Parameters H x γ n 2α MP tr ( H MP ) ( ) 2 F( x) β 2 ED + α 2 E W β E 2 D 2 αi + tr{ H } n i βλ i + 2α γ n 2α MP tr ( H MP ) 2α n βλ i + 2α n i n i βλ i βλ i + 2α Effective number of parameters will equal number of large eigenvalues of the Hessian. γ n βλ i βλ γ βλ i + 2α i γ i i γ βλ i + 2α i i n i 47

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23 Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is

More information

Introduction to Nonlinear Optimization Paul J. Atzberger

Introduction to Nonlinear Optimization Paul J. Atzberger Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts,

More information

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation

Numerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation Numerical computation II Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation Reprojection error Reprojection error = Distance between the

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Linear and Nonlinear Optimization

Linear and Nonlinear Optimization Linear and Nonlinear Optimization German University in Cairo October 10, 2016 Outline Introduction Gradient descent method Gauss-Newton method Levenberg-Marquardt method Case study: Straight lines have

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Chapter 3 Numerical Methods

Chapter 3 Numerical Methods Chapter 3 Numerical Methods Part 2 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization 1 Outline 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization Summary 2 Outline 3.2

More information

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

Numerical solutions of nonlinear systems of equations

Numerical solutions of nonlinear systems of equations Numerical solutions of nonlinear systems of equations Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan E-mail: min@math.ntnu.edu.tw August 28, 2011 Outline 1 Fixed points

More information

Non-linear least squares

Non-linear least squares Non-linear least squares Concept of non-linear least squares We have extensively studied linear least squares or linear regression. We see that there is a unique regression line that can be determined

More information

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by: Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline

V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline Goals Introduce Wiener-Hopf (WH) equations Introduce application of the steepest descent method to the WH problem Approximation to the Least

More information

Exam 2. Jeremy Morris. March 23, 2006

Exam 2. Jeremy Morris. March 23, 2006 Exam Jeremy Morris March 3, 006 4. Consider a bivariate normal population with µ 0, µ, σ, σ and ρ.5. a Write out the bivariate normal density. The multivariate normal density is defined by the following

More information

Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Regularization in Neural Networks

Regularization in Neural Networks Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Data Mining (Mineria de Dades)

Data Mining (Mineria de Dades) Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya

More information

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review ECE 680Modern Automatic Control p. 1/1 ECE 680 Modern Automatic Control Gradient and Newton s Methods A Review Stan Żak October 25, 2011 ECE 680Modern Automatic Control p. 2/1 Review of the Gradient Properties

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes) AMSC/CMSC 460 Computational Methods, Fall 2007 UNIT 5: Nonlinear Equations Dianne P. O Leary c 2001, 2002, 2007 Solving Nonlinear Equations and Optimization Problems Read Chapter 8. Skip Section 8.1.1.

More information

SRNDNA Model Fitting in RL Workshop

SRNDNA Model Fitting in RL Workshop SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

11 More Regression; Newton s Method; ROC Curves

11 More Regression; Newton s Method; ROC Curves More Regression; Newton s Method; ROC Curves 59 11 More Regression; Newton s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION Replace each X i with feature vector e.g. (X i ) = [X 2 i1 X i1 X i2

More information

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k Maria Cameron Nonlinear Least Squares Problem The nonlinear least squares problem arises when one needs to find optimal set of parameters for a nonlinear model given a large set of data The variables x,,

More information

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February 1 PageRank You are given in the file adjency.mat a matrix G of size n n where n = 1000 such that { 1 if outbound link from i to j,

More information

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB Learning highly non-linear functions f:

More information

Performance Surfaces and Optimum Points

Performance Surfaces and Optimum Points CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Matrix Derivatives and Descent Optimization Methods

Matrix Derivatives and Descent Optimization Methods Matrix Derivatives and Descent Optimization Methods 1 Qiang Ning Department of Electrical and Computer Engineering Beckman Institute for Advanced Science and Techonology University of Illinois at Urbana-Champaign

More information

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION 15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Second-order Learning Algorithm with Squared Penalty Term

Second-order Learning Algorithm with Squared Penalty Term Second-order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 69-2 Japan {saito,nakano}@cslab.kecl.ntt.jp

More information

Course Objectives. This course gives an introduction to basic neural network architectures and learning rules.

Course Objectives. This course gives an introduction to basic neural network architectures and learning rules. Introduction Course Objectives This course gives an introduction to basic neural network architectures and learning rules. Emphasis is placed on the mathematical analysis of these networks, on methods

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression

More information

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

First and Second Order Training Algorithms for Artificial Neural Networks to Detect the Cardiac State

First and Second Order Training Algorithms for Artificial Neural Networks to Detect the Cardiac State First Second Order Training Algorithms for Artificial Neural Networks to Detect the Cardiac State Sanjit K. Dash Department of ECE Raajdhani Engineering College, Bhubaneswar, Odisha, India G. Sasibhushana

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Loss Functions and Optimization. Lecture 3-1

Loss Functions and Optimization. Lecture 3-1 Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending

More information

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I 1 Sum rule: If set {x i } is exhaustive and exclusive, pr(x

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Practical Optimization: Basic Multidimensional Gradient Methods

Practical Optimization: Basic Multidimensional Gradient Methods Practical Optimization: Basic Multidimensional Gradient Methods László Kozma Lkozma@cis.hut.fi Helsinki University of Technology S-88.4221 Postgraduate Seminar on Signal Processing 22. 10. 2008 Contents

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Multidimensional Unconstrained Optimization Suppose we have a function f() of more than one

More information

Basic Math for

Basic Math for Basic Math for 16-720 August 23, 2002 1 Linear Algebra 1.1 Vectors and Matrices First, a reminder of a few basic notations, definitions, and terminology: Unless indicated otherwise, vectors are always

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor. ECE580 Exam 1 October 4, 2012 1 Name: Solution Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc.

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Augmented Reality VU numerical optimization algorithms. Prof. Vincent Lepetit

Augmented Reality VU numerical optimization algorithms. Prof. Vincent Lepetit Augmented Reality VU numerical optimization algorithms Prof. Vincent Lepetit P3P: can exploit only 3 correspondences; DLT: difficult to exploit the knowledge of the internal parameters correctly, and the

More information

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

EM-algorithm for Training of State-space Models with Application to Time Series Prediction EM-algorithm for Training of State-space Models with Application to Time Series Prediction Elia Liitiäinen, Nima Reyhani and Amaury Lendasse Helsinki University of Technology - Neural Networks Research

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Lecture 18 Oct. 30, 2014

Lecture 18 Oct. 30, 2014 CS 224: Advanced Algorithms Fall 214 Lecture 18 Oct. 3, 214 Prof. Jelani Nelson Scribe: Xiaoyu He 1 Overview In this lecture we will describe a path-following implementation of the Interior Point Method

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information