Slide05 Haykin Chapter 5: Radial-Basis Function Networks
|
|
- Flora Johnson
- 5 years ago
- Views:
Transcription
1 Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation, eg, backprop Design of nnet as a curve-fitting (approximation) problem, eg, RBF Curve-fitting: Finding a surface in a multidimensional space that provides a best fit to the training data Best fit measured in a certain statistical sense RBF is an example: hidden neurons forming an arbitrary basis for the input patterns when they are expanded into the hidden space These basis are called radial basis functions Radial-Basis Function Networks Cover s Theorem W φ φ φ 3 φ 4 Hidden Cover s theorem on the separability of patterns: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space Three layers: Hidden: nonlinear transformation from input to hidden space : linear activation Principal motivation: Cover s theorem pattern classifiction casted in high-dimensional space is more likely to be linearly separable than in low-dimensional space 3 Basic idea: nonlinearly map points in the input space to a hidden space that has a higher dimension than the input space Once the proper mapping is done, simple, quick algorithms can be used to find the separating hyperplane 4
2 φ-separability of Patterns N input patterns X = {x, x,, x N } in m -dimensional space The inputs belong to either of two sets X and X : they form a dichotomy The dichotomy is separable wrt a family of surfaces if a surface exists in the family that separates the points in class X from X For each x X, define an m -vector {φ i (x) i =,,, m }: φ(x) = [φ (x), φ (x),, φ m (x)] T that maps inputs in m -D space to the hidden space of m -D φ i (x) are called the hidden functions, and the space spanned by these functions is called the hidden space or feature space A dichotomy is φ-separable if there exists an m -D vector w such that: w T φ(x) >, x X w T φ(x) <, x X with separating hyperplane w T φ(x) = 5 Cover s Theorem Revisited Given a set X of N inputs picked from the input space independently, and suppose all the possible dichotomies of X are equiprobable Let P (N, m ) denote the probability that a particular dichotomy picked at random is φ-separable, where the family of surfaces has m degrees of freedom In this case, where «N m X P (N, m ) = l m! = m= N m l(l )(l )(l m + ) 6 m!! Cover s Theorem: Interpretation Example: XOR (again!) Separability depends on: () particular dichotomy, and () the distribution of patterns in the input space The derived P (N, m ) states that the probability of being φ-separable is equivalent to the cumulative binomial distribution corresponding to the probability that (N ) flips of a fair coin will result in (m ) or fewer heads In sum, Cover s theorem has two basic ingredients: Nonlinear mapping to hidden space with φ i (x) (i =,,, m ) With Gaussian hidden functions, the inputs become linearly separable in the hidden space: High dimensionality of hidden space compared to the input space (m > m ) Corollary: A maximum of m patterns can be linearly separated by a hidden space of m -D 7 φ (x) = exp( x t ), φ (x) = exp( x t ), 8 t = [, ] T t = [, ] T
3 D Gaussian RBF Learning: Overview exp(-(x*x+y*y)) W φ φ φ 3 φ 4 Hidden Basically, t i determines the center of the D Gaussian, φ i (x) = exp( x t i ) and points x that are equidistance from the center have the same φ value 9 The RBF learning problem boils down to two tasks: How to determine the parameters associated with the radial-basis functions in the hidden layer φ i (x) (eg, the center of the Gaussians) How to train the hidden-to-output weights?: This part is relatively easy RBF as an Interpolation Problem RBF and Interpolation W φ φ φ 3 φ 4 Hidden m -D input to -D output mapping s : R m R The map s can be thought of as a hypersurface Γ R m + Training: fit hypersurface Γ to the training data points Generalization: interpolate between data points, along the reconstructed surface Γ Given {x i R m i =,,, N} and N labels {d i R i =,,, N}, find F : R N R such that F (x i ) = d i forall i Interpolation is formulated as F (x) = w i φ( x x i ), where {φ( x x i ) i =,,, N} is a set of N arbitrary (nonlinear) functions known as radial-basis functions The known data points x i R m, i =,,, N are treated as the centers of the RBFs (Note that in this case, all input data need to be memorized, as in instance-based learning, but this is not a necessary requirement)
4 RBF and Interpolation (cont d) RBF and Interpolation (cont d) F (x) = w i φ( x x i ), So, we have N inputs and N hidden units, and one output unit Expressing everything (all N input-output pairs) in matrix form: 3 3 φ φ φ N w φ φ φ N w = φ N φ N φ NN w N where φ ji = φ( above as: x j {z} x i {z} Center φw = d 3 d d d N ) We can abbreviate the 3, 7 5 From φw = d, we can find an explicit solution w: w = φ d, assuming φ is nonsingular (Note: in general, the number of hidden units is much less than the number of inputs, so we don t always have φ as a square matrix! We ll see how to handle this, later) Nonsingularity of φ is guaranteed by Micchelli s theorem: Let {x i } N be a set of distinct points in Rm Then the N -by-n interpolation matrix φ, whose ji-th element is φ ji = φ( x j x i ), is nonsingular 4 RBF and Interpolation (cont d) Typical RBFs When m < N (m : number of hidden units; N : number of inputs), we can find w that minimizes E(w) = (F (x i ) d i ), where F (x) = P m k= w kφ k (x) The solution involves the pseudo inverse of φ: w = φ T φ φ T d {z } pseudo inverse Note: φ is an N m rectangular matrix In this case, how to determine the centers of the φ k ( ) functions becomes an issue 5 For some c >, σ >, and r R Multiquadrics (non-local): Inverse multiquadrics (local): Gaussian functions (local): φ(r) = (r + c ) / φ(r) = (r + c ) / φ(r) = exp 6 r σ «
5 Other Perspectives on RBF Learning RBF learning is formularized as F (x) = m X k= w k φ k (x) This kind of expression was given without much rationale, other than intuitive appeal However, there s one way to derive the above formalism based on an interesting theoretical point-of-view, which we will see next Supervised Learning as Ill-Posed Hypersurface Reconstruction Problem The exact interpolation approach has limitations: Poor generalization: data points being more numerous than the degree of freedom of the underlying process can lead to overfitting How to overcome this issue? Approach the problem from a perspective that learning is a hypersurface reconstruction problem given a sparse set of data points Contrast between direct problem (in many cases well-posed) vs inverse problem (in many cases ill-posed) 7 8 Well-Posed Problems in Reconstructing Functional Mapping Given an unknown mapping from domain X to range Y, we want to reconstruct the mapping f This mapping is well-posed if all the following conditions are satisfied: Existence: For all x X, there exist an output y Y such that y = f(x) Uniqueness: For all x, t X, f(x) = f(t) iff x = t Continuity: The mapping is continuous For any ɛ > there exists δ = δ(ɛ) such that ρ x (x, t) < δ ρ y (f(x), f(t)) < ɛ, where ρ(, ) is the distance measure If any of these conditions are violated, the problem is called an ill-posed problem Ill-Posed Problems and Solutions Direct (causal) mapping are generally well-posed (eg, 3D object to D projection) On the other hand, inverse problems are ill-posed (eg, reconstructing 3D structure from D projections) For ill-posed problems, solutions are not unique (can in many cases they can be infinite): We need prior knowledge (or some kind of preference) to narrow down the range of solutions (this is called regularization) Treating supervised learning as an ill-posed problem, and using certain prior knowledge, we can derive the RBF formalism 9
6 Regularization Theory: Overview The main idea behind regularization is to stabilize the solution by means of prior information This is done by including a functional (a function that maps from a function to a scalar) in the cost function, so that the functional can also be minimized Only a small number of candidate solutions will minimize this functional These functional terms are called the regularization term Typically, the functionals measure the smoothness of the function Tikonov s Regularization Theory Task: Given input x i R m and target d i R, find F (x) Minimize the sum of two terms: Standard error term: E s (F ) = (d i y i ) = (d i F (x i )) Regularization term: E c (F ) = DF, where D is a linear differential operator, and the norm of the function space Putting these together, we want to minimize (w/ regularization param λ) E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Error Term vs Regularization Term Solution that Minimizes E(F ) data fit data fit 5 data fit Problem: minimize RBFs RBFs RBFs E(F ) = E s (F )+λe c (F ) = (d i F (x i )) + λ DF Solution: F λ (x) that satisfies the Euler-Lagrange equation (below) minimizes E(F ) eddf λ (x) λ [d i F (x i )] δ(x x i ) =, Error Regularization Left Bad fit Extremely smooth Middle Good fit Smooth where e D is the adjoint operator of D and δ( ) is the Dirac delta function Right Over fit Jagged Try this demo: 3 4
7 Solution that Minimizes E(F ) (cont d) The solution to the Euler-Lagrange equation can be formulated in terms of the Green s function that satisfies: eddg(x, x ) = δ(x, x ) Note: the form of G(, ) depends on the particular choice of D Finally, the desired function F λ (x) that minimizes E(F ) is: Solution that Minimizes E(F ) (cont d) Letting w i = λ [d i F (x i )], i =,, N we can recast F λ (x) as F λ (x) = Plugging in input x j, we get w i G(x, x i ) F λ (x) = λ [d i F (x i )] G(x, x i ) 5 F λ (x j ) = Note the similarity to the RBF: w i G(x j, x i ) F (x) = w i φ( x x i ) 6 Solution that Minimizes E(F ) (cont d) We can use a matrix notation: F λ = [F λ (x ), F λ (x ), F λ (x N )] T d = [d, d,, d N ] T G(x, x ) G(x, x ) G(x, x N ) G(x, x ) G(x, x ) G(x, x N ) G = 6 4 G(x N, x ) G(x N, x ) G(x N, x N ) w = [w, w,, w N ] T Then we can rewrite the formula in the previous page as w = λ (d F λ), and F λ = Gw Combining Solution that Minimizes E(F ) (cont d) we can eliminate F λ to get w = λ (d F λ) F λ = Gw (G + λi)w = d From this, we can get the weights: w = (G + λi) d if G + λi is invertible (it needs to be positive definite, which can be ensured by a large λ) Note: G(x i, x j ) = G(x j, x i ), thus G T = G 7 8
8 Solution that Minimizes E(F ) (cont d) Steps to follow: Determine D Find the Green s function matrix G associated with D 3 Obtain the weights by w = (G + λi) d For certain Ds, the corresponding Green s functions are Gaussian, inverse multiquadrics, etc Thus, the whole approach is similar to RBF (More on this later) Regularization Theory and RBF In conclusion, the regularization problem s solution is given by the expansion: F λ (x) = w i G(x, x i ), where G(, ) is the Green s function for the self-adjoint operator DD e When the stabilizer D has certain properties, the resulting Green s function G(, ) become radial basis functions: D is translationally invariant: G(x, x i ) = G(x x i ) D is translationally and rotationally invariant: G(x, x i ) = G( x x i ), which is a RBF! 9 3 Translationally and Rotationally Invariant D One example of a Green s function that corresponds to a translationally and rotationally invariant D is the multivariate Gaussian function: G(x, x i ) = exp «σi x x i With this, the regularized solution becomes F λ (x) = w i exp «σi x x i This function is known to be a universal approximator The regularized solution Regularization Networks F λ (x) = F(x) w w w N G G G G Hidden x x m can be represented as a network w i exp «σi x x i Hidden unit i computes G(x, x i ) (one hidden unit per input pattern) unit is linear weighted sum of hidden unit activation Problem: we need N hidden units for N input patterns, which can be excessive for large input sets 3 3
9 Desirable Properties of Regularization Networks The regularization network is a universal approximator, that can approximate any multivariate continuous function arbitrarily well, given sufficiently large number of hidden units The approximation shows the best approximation property (best coefficients will be found) The solution is optimal: it will minimize the cost functional E(F) Generalized Radial-Basis Function Networks w =b w w w m φ= φ φ φ φ Hidden x x m Regularization networks can be computationally demanding when N is huge Generalized RBF overcomes this problem m X F (x) = w i φ i (x), where m < N, and φ i (x) = G( x t i ), so that m X F (x) = w i G( x t i ) Generalized Radial-Basis Function Networks (cont d) To find the weights w i, we minimize m! X E(F ) = d i w i G( x t i ) + λ DF Weighted Norm When individual input lines in the input vector x are from different classes, a weighted norm can be used x C = (Cx)T (Cx) = x T C T Cx Minimization of E(F ) yields The approximation function can be rewritten as G = 6 4 (G T G + λg )w = G T d, where G(t, t ) G(t, t ) G(t, t m ) G(t, t ) G(t, t ) G(t, t m ) G(t m, t ) G(t m, t ) G(t m, t m m X F (x) = w i G( x t i C ) For Gaussian RBF, it can be interpreted as G( x t i C ) = exp ` (x t i ) T C T C(x t i ) = exp ` (x t i) T Σ (x t i ), As λ approaches, we get G T Gw = G T d, so, where t i represents the mean vector and Σ the covariance w = (G T G) G T d = G + d 35 matrix of a multivariate Gaussian function 36
10 Regularization Networks vs Generalized RBF F(x) w =b w w w N w w w m φ= G G G G Hidden φ φ φ φ Hidden x x m x x m Hidden layer in GRBF is much smaller: m < N In GRBF, () the weights, () the RBF centers t i, and (3) the norm weighting matrix are all unknown parameters to be determined In regularization networks, RBF centers are known (same as all the inputs), and only the weights need to be determined Estimating the Parameters Weights w i : already discussed (more next) Regularization parameter λ: Minimize averaged squared error Use generalized cross-validation RBF centers: Randomly select fixed centers Self-organized selection Supervised selection Estimating the Regularization Parameter λ RBF Learning Minimize average squared error: For a fixed λ, for all N inputs, calculate the squared error betwen the true function value and the estimated RBF network output using the λ Find the optimal λ that minimizes this error Problem: This requires knowledge of the true function values Generalized cross-validation: Use leave-one-out cross validation With a fixed λ, for all N inputs, find the difference between the target value (from the training set) and the predicted value from the leave-one-out-trained network This approach depend only on the training set w =b w w w m φ= φ φ φ φ Hidden x x m Basic idea is to learn in two different time scales: Nonlinear, slow learning of the RBF parameters (center, variance) Linear, fast learning of the hidden-to-output weights 39 4
11 RBF Learning (/3): Random Centers Use m hidden units: G( x t i ) = exp m «x t d i, max where t i (i =,,, m ) are picked by random from the available inputs x j (j =,,, N) Note that the standard deviation (width) of the RBF is fixed to: σ = d max m, where d max is the max distance between the chosen centers t i This gives a width that is not too peaked nor too flat The linear weights are learned using the pseudoinverse: w = G + d = (G T G) G T d, where the matrix G = {g ji }, g ji = G( x j t i ) 4 Finding G + with Singular Value Decomposition If for a real N M matrix G, there exists orthogonal matrices U = [u, u,, u N ] V = [v, v,, v M ], such that U T GV = diag(σ, σ,, σ K ) = Σ, K = min(m, N), then U is called the left singular matrix, V the right singular matrix, and σ, σ,, σ K the singular values of the matrix G Once these are known, we can obtain G + as G + = VΣ + U T where Σ + = diag,, There are efficient σ σ σ K algorithms for singular value decomposition that can be used for this 4 Finding G + with Singular Value Decomposition Using these properties we can verify that GG + = I: (cont d) U = U T V = V T ΣΣ + = I U T GV = Σ UU T GVV T = UΣV T G = UΣV T GG + = UΣV T VΣ + U T = UΣΣ + U T = UU T = I 43 RBF Learning (/3): Self-Organized Centers The random-center approach is only effective with large input sets To overcome this, we can take a hybrid approach: () self-organized learning of centers, and () supervised learning of linear weights Clustering for RBF center learning (similar to Self-Organizing Maps): Initialization: Randomly choose distinct t k ()s Sampling: Draw a random input vector x X 3 Similarity matching: Find best-matching center vector t k(x) : k(x) = arg min x(n) t k (n) k 4 Updating: Update center vectors ( tk (n) + η[x(n) t k (n)], if k = k(x) t k (n+) = t k (n), 5 Continuation: increment n and repeat from step 44 otherwise
12 RBF Learning (3/3): Supervised Selection of Centers Use error correction learning to adjust all RBF parameters to minimize the error cost function: E = e j = d j F (x j ) = d j e j, j= MX w i G( x j t i Ci ) Linear weights (output layer): w i (n + ) = w i (n) η E(n) w i (n) Position of centers (hidden layer): t i (n + ) = t i (n) η E(n) t i (n) Spread of centers (hidden layer): Σ i (n + ) = Σ E(n) i (n) η 3 Σ (n) 45 i RBF Learning (3/3): Supervised Selection of Centers Linear weights E(n) w i (n) = Position of centers E(n) t i (n) Spread of centers E(n) Σ i (cont d) e j (n)g( x j t i (n) Ci ) j= = w i (n) e j (n)g x j t i (n) Ci Σ h i (n) x i j t i (n) j= (n) = w i(n) e j (n)g ` x j t i (n) Ci Qij (n), j= Q ji (n) = [x j t i (n)][x j t i (n)] T 46 Comparison of RBF and MLP RBF has a single hidden layer, while MLP can have many In MLP, hidden and output neurons have the same underlying function In RBF, they are specialized into distinct functions In RBF, the output layer is linear, but in MLP, all neurons are nonlinear The hidden neurons in RBF calculate the Euclidean norm of the input vector and the center, while in MLP the inner product of the input vector and the weight vector is calculated MLPs construct global approximations to nonlinear input output mapping RBF uses exponentially decaying localized nonlinearities (eg Gaussians) to construct local approximations to nonlinear input output mappings 47 Summary RBF network is unusual due to its two different unit types: RBF hidden layer, and linear output layer RBF is derived in a principled manner, starting from Tikohonov s regularization theory, unlike MLP In RBF, the smoothing term becomes important, with different D giving rise to different Green s function G(, ) Generalized RBF lifts the requirement of N hidden units for N input patterns, greatly reducing the computational complexity Proper estimation of the regularization parameter λ is needed 48
In the Name of God. Lectures 15&16: Radial Basis Function Networks
1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training
More informationRadial-Basis Function Networks
Radial-Basis Function etworks A function is radial () if its output depends on (is a nonincreasing function of) the distance of the input from a given stored vector. s represent local receptors, as illustrated
More informationNeural Networks Lecture 4: Radial Bases Function Networks
Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi
More informationRadial Basis Function (RBF) Networks
CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending
More informationRadial-Basis Function Networks
Radial-Basis Function etworks A function is radial basis () if its output depends on (is a non-increasing function of) the distance of the input from a given stored vector. s represent local receptors,
More informationVorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -
Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze - SS 004 Holger Fröhlich (abg. Vorl. von S. Kaushik¹) Lehrstuhl Rechnerarchitektur, Prof. Dr. A. Zell ¹www.cse.iitd.ernet.in/~saroj Radial
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSVMs: nonlinearity through kernels
Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92
ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More information3.4 Linear Least-Squares Filter
X(n) = [x(1), x(2),..., x(n)] T 1 3.4 Linear Least-Squares Filter Two characteristics of linear least-squares filter: 1. The filter is built around a single linear neuron. 2. The cost function is the sum
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationMark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.
University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationRadial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition
Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.
More informationThe perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.
1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models
More informationFitting Linear Statistical Models to Data by Least Squares I: Introduction
Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling February 5, 2014 version
More informationPlan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas
Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis
More informationFitting Linear Statistical Models to Data by Least Squares: Introduction
Fitting Linear Statistical Models to Data by Least Squares: Introduction Radu Balan, Brian R. Hunt and C. David Levermore University of Maryland, College Park University of Maryland, College Park, MD Math
More informationA New Trust Region Algorithm Using Radial Basis Function Models
A New Trust Region Algorithm Using Radial Basis Function Models Seppo Pulkkinen University of Turku Department of Mathematics July 14, 2010 Outline 1 Introduction 2 Background Taylor series approximations
More informationLeast Squares Approximation
Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationyou expect to encounter difficulties when trying to solve A x = b? 4. A composite quadrature rule has error associated with it in the following form
Qualifying exam for numerical analysis (Spring 2017) Show your work for full credit. If you are unable to solve some part, attempt the subsequent parts. 1. Consider the following finite difference: f (0)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationEE731 Lecture Notes: Matrix Computations for Signal Processing
EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten
More informationMulti-layer Neural Networks
Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationScattered Interpolation Survey
Scattered Interpolation Survey j.p.lewis u. southern california Scattered Interpolation Survey p.1/53 Scattered vs. Regular domain Scattered Interpolation Survey p.2/53 Motivation modeling animated character
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationIn the Name of God. Lecture 11: Single Layer Perceptrons
1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationNN V: The generalized delta learning rule
NN V: The generalized delta learning rule We now focus on generalizing the delta learning rule for feedforward layered neural networks. The architecture of the two-layer network considered below is shown
More informationCS168: The Modern Algorithmic Toolbox Lecture #6: Regularization
CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationA Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation
1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More informationL20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs
L0: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationIntroduction to Hilbert Space Frames
to Hilbert Space Frames May 15, 2009 to Hilbert Space Frames What is a frame? Motivation Coefficient Representations The Frame Condition Bases A linearly dependent frame An infinite dimensional frame Reconstructing
More informationLecture: Face Recognition and Feature Reduction
Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab 1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed in the
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationLecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012
CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,
More informationSupport Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationPrincipal Component Analysis (PCA) CSC411/2515 Tutorial
Principal Component Analysis (PCA) CSC411/2515 Tutorial Harris Chan Based on previous tutorial slides by Wenjie Luo, Ladislav Rampasek University of Toronto hchan@cs.toronto.edu October 19th, 2017 (UofT)
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSince the determinant of a diagonal matrix is the product of its diagonal elements it is trivial to see that det(a) = α 2. = max. A 1 x.
APPM 4720/5720 Problem Set 2 Solutions This assignment is due at the start of class on Wednesday, February 9th. Minimal credit will be given for incomplete solutions or solutions that do not provide details
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More information