Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain
|
|
- Valentine McKenzie
- 6 years ago
- Views:
Transcription
1 Generalization
2 Generalization A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain 2
3 Generalization The network input-output mapping is accurate for the training data and for test data never seen before. The network interpolates well. 3
4 Cause of Overfitting Poor generalization is caused by using a network that is too complex (too many neurons/parameters). To have the best performance we need to find the least complex network that can represent the data (Ockham s Razor). 4
5 Ockham s Razor Find the simplest model that explains the data. 5
6 Problem Statement Training Set { p, t } p 2 t 2, {, },,{ p Q, t Q } Underlying Function t q g( p q ) + ε q Performance Function Q F( x ) E D ( t q a q ) T ( t q a q ) q 6
7 Poor Generalization Overfitting Extrapolation Interpolation 7
8 Good Generalization Interpolation Extrapolation 8
9 Extrapolation in 2-D 9
10 Measuring Generalization Test Set Part of the available data is set aside during the training process. After training, the network error on the test set is used as a measure of generalization ability. The test set must never be used in any way to train the network, or even to select one network from a group of candidate networks. The test set must be representative of all situations for which the network will be used. 0
11 Methods for Improving Generalization Pruning (removing neurons) until the performance is degraded. Growing (adding neurons) until the performance is adequate. Validation Methods Regularization
12 Early Stopping Break up data into training, validation, and test sets. Use only the training set to compute gradients and determine weight updates. Compute the performance on the validation set at each iteration of training. Stop training when the performance on the validation set goes up for a specified number of iterations. Use the weights which achieved the lowest error on the validation set. 2
13 Early Stopping Example F(x) Validation a b Training a b 3
14 Regularization Standard Performance Measure F E D Performance Measure with Regularization F βe D α E W Complexity Penalty Q ( ) 2 + α x i + β ( t q a q ) T t q a q q n i (Smaller weights means a smoother function.) 4
15 Effect of Weight Changes w, w, 2 w, 2 2 w,
16 Effect of Regularization α/β 0 α/β 0.0 α/β 0.25 α/β 6
17 Bayes Rule PAB ( ) PBA ( )P( A) PB ( ) P(A) Prior Probability. What we know about A before B is known. P(A B) Posterior Probability. What we know about A after we know the outcome of B. P(B A) Conditional Probability (Likelihood Function). Describes our knowledge of the system. P(B) Marginal Probability. A normalization factor. 7
18 Example Problem % of the population have a certain disease. A test for the disease is 80% accurate in detecting the disease in people who have it. 0% of the time the test yields a false positive. If you have a positive test, what is your probability of having the disease? 8
19 Bayesian Analysis PAB ( ) PBA ( )P( A) PB ( ) A Event that you have the disease. B Event that you have a positive test. P(A) 0.0 P(B A) 0.8 P(B) P(B A)P(A) + P(B ~A)P(~A) PAB ( ) PBA ( )P( A) PB ( )
20 Signal Plus Noise Example t x+ ε f( ε) ε 2 exp πσ 2σ 2 fx ( ) exp πσ x 2σ x x 2 ftx ( ) ( t x) 2 exp πσ 2σ 2 fxt ( ) ftx ( )f( x) ft () f( xt) 0.4 f( tx) 0.2 f( x)
21 NN Bayesian Framework MP Posterior (MacKay 92) ML Likelihood Prior P( x D, αβm,, ) PDx (, β, M)P( x α, M) - PDαβM (,, ) Normalization (Evidence) D - Data Set M - Neural Network Model x - Vector of Network Weights 2
22 Gaussian Assumptions Gaussian Noise PDx (, β, M) exp( βe Z D ( β) D ) Z D ( β) ( 2πσ ε ) N 2 ( π β ) N 2 P( x α, M) Gaussian Prior: exp( αe Z W ( α ) W ) Z W ( α) ( 2πσ w ) n 2 ( π α) n 2 P( x D, αβm,, ) exp ( ( βe Z W ( α) Z D ( β ) D + αe W )) Normalization Factor exp( F( x ) Z F ( αβ, ) F β E D + α E W Minimize F to Maximize P. 22
23 Optimizing Regularization Parameters Evidence from First Level Second Level of Inference P( αβd,, M) P( D α, β, M)P ( α, β M) P( D M) Evidence: PDαβM (,, ) PDx (, β, M)P x ( α, M) P( x D, αβm,, ) exp ( βe Z D ( β ) D ) exp ( α E Z W ( α) W ) Z exp( F ( αβ, ) F ( x )) Z F ( αβ, ) Z D ( β )Z W ( α) exp( βe D α E W ) exp ( F( x ) Z F ( αβ, ) Z D ( β)z W ( α ) Z F ( αβ, ) is the only unknown in this expression. 23
24 Quadratic Approximation Taylor series expansion: F( x ) F( x MP ) ( x x MP ) T H MP ( x x MP ) H β 2 E D + α 2 E W Substituting into previous posterior density function: P( x D, αβm,, ) P( x D, α, β, M) P x ( ) --- exp F( x MP ) Z F 2 -( x x MP ) T H MP x x MP ( ) ---exp F( x Z MP ( )) exp -- x x 2 ( MP ) T H MP ( x x MP ) F Equate with standard Gaussian density: exp ( 2π ) n ( H MP ) 2 ( ) -( x x MP ) T H MP x x MP Comparing to previous equation, we have: Z F ( αβ, ) ( 2π) n 2 ( det( ( H MP ) )) 2 exp( F( x MP )) 24
25 Optimum Parameters If we make this substitution for Z F in the expression for the evidence and then take the derivative with respect to α and β to locate the minimum we find: α MP γ 2E W ( x MP ) β MP N γ 2E D ( x MP ) Effective Number of Parameters γ n 2α MP tr( H MP ) 25
26 Gauss-Newton Approximation It can be expensive to compute the Hessian matrix. Try the Gauss-Newton Approximation. H 2 F( x) 2βJ T J + 2αI n This is readily available if the Levenberg-Marquardt algorithm is used for training. 26
27 Algorithm (GNBR) 0. Initialize α, β and the weights.. Take one step of Levenberg-Marquardt to minimize F(w). 2. Compute the effective number of parameters γ n - 2αtr(H - ), using the Gauss-Newton approximation for H. 3. Compute new estimates of the regularization parameters α γ/(2e W ) and β (N-γ)/(2E D ). 4. Iterate steps -3 until convergence. 27
28 Checks of Performance If γ is very close to n, then the network may be too small. Add more hidden layer neurons and retrain. If the larger network has the same final γ, then the smaller network was large enough. Otherwise, increase the number of hidden neurons. If a network is sufficiently large, then a larger network will achieve comparable values for γ, E D and E W. 28
29 GNBR Example α β
30 Convergence of GNBR E D E D Training Testing Iteration Iteration γ α/β Iteration Iteration 30
31 Relationship between Early Stopping and Regularization 3
32 Linear Network Input Linear Neuron R p R x W S x R b S x a S n S x S x a purelin( Wp + b) Wp + b a purelin (Wp + b) a i purelin( n i ) purelin( w T i p + b i ) w T i p + b i i w w i, w i 2, w i R, 32
33 Performance Index Training Set: { p, t }, { p 2, t 2 },, { p Q, t Q } Input: p q Target: t q Notation: x w z p a w T p + b a x T z b Mean Square Error: F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] E D 33
34 Error Analysis F( x ) E[ e 2 ] E[ ( t a) 2 ] E[( t x T z) 2 ] F( x ) E[ t 2 2tx T z + x T zz T x ] F( x ) E[ t 2 ] 2x T E[ tz] + x T E[ zz T ]x F( x) c 2 x T h + x T Rx c E[ t 2 ] h E[ tz] R E[ zz T ] The mean square error for the Linear Network is a quadratic function: F( x) c + d T x + -x T Ax 2 d 2h A 2R 34
35 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a p 2, t 2 F( x) c 2x T h + x T Rx (Probability 0.25) E D a purelin (Wp + b) c E[ t 2 ] ( ) 2 ( 0.75) + ( ) 2 ( 0.25 ) h Etz [ ] ( 0.75) ( ) + ( 0.25 )( ) 0.5 R E[ zz T T T ] p p ( 0.75) + p2 p 2 ( 0.25)
36 Performance Contour Optimum Point (Maximum Likelihood) Hessian Matrix x ML A d R h F( x) A 2R 2 2 A λi Eigenvalues 2 λ λ 2 4λ + 3 ( λ ) ( λ 3) λ, λ2 3 2 λ Eigenvectors A λi v 0 λ v 0 v λ 2 3 v 2 0 v 2 36
37 Contour Plot of E D γ n 2α MP tr( H MP ) 37
38 Steepest Descent Trajectory x k + x k αg k x k α( Ax k + d) x k αax ( k + A d) x k αax ( k x ML ) [ I αa]x k + αax ML Mx k + [ I M]x ML M [ I αa] x Mx 0 + [ I M]x ML x 2 Mx + [ I M]x ML M 2 x 0 + MI [ M]x ML +[ I M]x ML M 2 x 0 + Mx ML M 2 x ML + x ML Mx ML M 2 x 0 + x ML M 2 x ML M 2 2 x0 + [ I M ]x ML x k M k x 0 + [ I M k ]x M L 38
39 Regularization F( x) E D + ρe W (ρ α/β) E W -- ( x x 2 0 ) T ( x x 0 ) To locate the minimum point, set the gradient to zero. ( x x 0 ) E D Ax ( x ML ) E W F x ( ) E D + ρ E W F ( x ) Ax ( x M L ) ρ x x 0 + ( ) 0 39
40 MAP ML Ax MAP ( x ML ) ρ( x MAP x 0 ) ρ x MAP x ML + x ML ( x 0 ) x ML ρ x ML x 0 ρ( x MAP ) ( ) ( A + ρi )( x MAP x ML ) ρ( x 0 x ML ) x MAP x ML ( ) ρ A ρi ( + ) ( x 0 x ML ) x MAP x ML ρ( A + ρi ) x ML + ρ( A + ρi ) x 0 x ML M ρ x ML + M ρ x 0 M ρ ρ( A + ρi ) x MAP M ρ x 0 + [ I M ρ ]x ML 40
41 Early Stopping Regularization x k M k x 0 + [ I M k ]x M L x MAP M ρ x 0 + [ I M ρ ]x ML M [ I αa] M ρ ρ( A + ρi ) Eigenvalues of M k : [ I αa]z i z i αaz i z i α λ i z i ( α λ i )z i z i λ i - eigenvector of A - eigenvalue of A eig( M k ) ( αλ i ) k Eigenvalues of M Eigenvalues of M γ : eig( M ρ ) ρ ( + ρ) λ i 4
42 Reg. Parameter Iteration Number M k and M ρ have the same eigenvectors. They would be equal if their eigenvalues were equal. ρ ( αλ ( + ρ) i ) k λ Taking log : i log + -- ρ klog( αλ i ) λ i Since these are equal at λ i 0, they are always equal if the slopes are equal λ + -- i ρ - ρ k ( α) αλ i α k -- ρ ( αλ i ) ( + λ i ρ) If αλ i and λ i /ρ are small, then: αk -- ρ (Increasing the number of iterations is equivalent to decreasing the regularization parameter!) 42
43 Example Inputs Two-Input Neuron p, t (Probability 0.75) p w, S p 2 w,2 b n a purelin (Wp + b) a p 2, t 2 F ( x ) E D + ρe W (Probability 0.25) E D c + x T d x T Ax E W -x T x 2 2 d 2h A 2R c F( x) 2E D + ρ 2E 2 W + ρ ρ 2 + ρ 43
44 ρ 0, 2,
45 ρ
46 Steepest Descent Path
47 Effective Number of Parameters H x γ n 2α MP tr ( H MP ) ( ) 2 F( x) β 2 ED + α 2 E W β E 2 D 2 αi + tr{ H } n i βλ i + 2α γ n 2α MP tr ( H MP ) 2α n βλ i + 2α n i n i βλ i βλ i + 2α Effective number of parameters will equal number of large eigenvalues of the Hessian. γ n βλ i βλ γ βλ i + 2α i γ i i γ βλ i + 2α i i n i 47
Neural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationOptimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23
Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is
More informationIntroduction to Nonlinear Optimization Paul J. Atzberger
Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts,
More informationNumerical computation II. Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation
Numerical computation II Reprojection error Bundle adjustment Family of Newtonʼs methods Statistical background Maximum likelihood estimation Reprojection error Reprojection error = Distance between the
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLinear and Nonlinear Optimization
Linear and Nonlinear Optimization German University in Cairo October 10, 2016 Outline Introduction Gradient descent method Gauss-Newton method Levenberg-Marquardt method Case study: Straight lines have
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationChapter 3 Numerical Methods
Chapter 3 Numerical Methods Part 2 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization 1 Outline 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization Summary 2 Outline 3.2
More informationMethods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent
Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University niclas.borlin@cs.umu.se A disadvantage with the Newton method is that the Hessian has to be derived
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationCS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares
CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More informationNumerical solutions of nonlinear systems of equations
Numerical solutions of nonlinear systems of equations Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan E-mail: min@math.ntnu.edu.tw August 28, 2011 Outline 1 Fixed points
More informationNon-linear least squares
Non-linear least squares Concept of non-linear least squares We have extensively studied linear least squares or linear regression. We see that there is a unique regression line that can be determined
More information1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:
Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationV. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline
V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline Goals Introduce Wiener-Hopf (WH) equations Introduce application of the steepest descent method to the WH problem Approximation to the Least
More informationExam 2. Jeremy Morris. March 23, 2006
Exam Jeremy Morris March 3, 006 4. Consider a bivariate normal population with µ 0, µ, σ, σ and ρ.5. a Write out the bivariate normal density. The multivariate normal density is defined by the following
More informationNonlinear Optimization: What s important?
Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global
More informationOptimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30
Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained
More informationRegularization in Neural Networks
Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationData Mining (Mineria de Dades)
Data Mining (Mineria de Dades) Lluís A. Belanche belanche@lsi.upc.edu Soft Computing Research Group Dept. de Llenguatges i Sistemes Informàtics (Software department) Universitat Politècnica de Catalunya
More informationECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review
ECE 680Modern Automatic Control p. 1/1 ECE 680 Modern Automatic Control Gradient and Newton s Methods A Review Stan Żak October 25, 2011 ECE 680Modern Automatic Control p. 2/1 Review of the Gradient Properties
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationMotivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)
AMSC/CMSC 460 Computational Methods, Fall 2007 UNIT 5: Nonlinear Equations Dianne P. O Leary c 2001, 2002, 2007 Solving Nonlinear Equations and Optimization Problems Read Chapter 8. Skip Section 8.1.1.
More informationSRNDNA Model Fitting in RL Workshop
SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon;
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More information11 More Regression; Newton s Method; ROC Curves
More Regression; Newton s Method; ROC Curves 59 11 More Regression; Newton s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION Replace each X i with feature vector e.g. (X i ) = [X 2 i1 X i1 X i2
More informationj=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k
Maria Cameron Nonlinear Least Squares Problem The nonlinear least squares problem arises when one needs to find optimal set of parameters for a nonlinear model given a large set of data The variables x,,
More informationCS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February
CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February 1 PageRank You are given in the file adjency.mat a matrix G of size n n where n = 1000 such that { 1 if outbound link from i to j,
More informationMachine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB Learning highly non-linear functions f:
More informationPerformance Surfaces and Optimum Points
CSC 302 1.5 Neural Networks Performance Surfaces and Optimum Points 1 Entrance Performance learning is another important class of learning law. Network parameters are adjusted to optimize the performance
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationMatrix Derivatives and Descent Optimization Methods
Matrix Derivatives and Descent Optimization Methods 1 Qiang Ning Department of Electrical and Computer Engineering Beckman Institute for Advanced Science and Techonology University of Illinois at Urbana-Champaign
More informationLECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION
15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationPhysics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester
Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability
More informationSecond-order Learning Algorithm with Squared Penalty Term
Second-order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 69-2 Japan {saito,nakano}@cslab.kecl.ntt.jp
More informationCourse Objectives. This course gives an introduction to basic neural network architectures and learning rules.
Introduction Course Objectives This course gives an introduction to basic neural network architectures and learning rules. Emphasis is placed on the mathematical analysis of these networks, on methods
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationBayesian Linear Regression. Sargur Srihari
Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationWays to make neural networks generalize better
Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationReferences. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor
References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression
More informationA Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley
A Tour of Reinforcement Learning The View from Continuous Control Benjamin Recht University of California, Berkeley trustable, scalable, predictable Control Theory! Reinforcement Learning is the study
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationFirst and Second Order Training Algorithms for Artificial Neural Networks to Detect the Cardiac State
First Second Order Training Algorithms for Artificial Neural Networks to Detect the Cardiac State Sanjit K. Dash Department of ECE Raajdhani Engineering College, Bhubaneswar, Odisha, India G. Sasibhushana
More informationOutline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren
1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLoss Functions and Optimization. Lecture 3-1
Lecture 3: Loss Functions and Optimization Lecture 3-1 Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending
More informationBayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I
Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I 1 Sum rule: If set {x i } is exhaustive and exclusive, pr(x
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationPractical Optimization: Basic Multidimensional Gradient Methods
Practical Optimization: Basic Multidimensional Gradient Methods László Kozma Lkozma@cis.hut.fi Helsinki University of Technology S-88.4221 Postgraduate Seminar on Signal Processing 22. 10. 2008 Contents
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationEAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science
EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Multidimensional Unconstrained Optimization Suppose we have a function f() of more than one
More informationBasic Math for
Basic Math for 16-720 August 23, 2002 1 Linear Algebra 1.1 Vectors and Matrices First, a reminder of a few basic notations, definitions, and terminology: Unless indicated otherwise, vectors are always
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.
ECE580 Exam 1 October 4, 2012 1 Name: Solution Score: /100 You must show ALL of your work for full credit. This exam is closed-book. Calculators may NOT be used. Please leave fractions as fractions, etc.
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationAugmented Reality VU numerical optimization algorithms. Prof. Vincent Lepetit
Augmented Reality VU numerical optimization algorithms Prof. Vincent Lepetit P3P: can exploit only 3 correspondences; DLT: difficult to exploit the knowledge of the internal parameters correctly, and the
More informationEM-algorithm for Training of State-space Models with Application to Time Series Prediction
EM-algorithm for Training of State-space Models with Application to Time Series Prediction Elia Liitiäinen, Nima Reyhani and Amaury Lendasse Helsinki University of Technology - Neural Networks Research
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationLecture 18 Oct. 30, 2014
CS 224: Advanced Algorithms Fall 214 Lecture 18 Oct. 3, 214 Prof. Jelani Nelson Scribe: Xiaoyu He 1 Overview In this lecture we will describe a path-following implementation of the Interior Point Method
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More information