Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas
|
|
- Chrystal Elliott
- 6 years ago
- Views:
Transcription
1 Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis Functions and dimensionality reduction From Additive Splines to ridge functions approximation
2 Regularization approach Asmooth function that approximates the data set D = f(x i y i ) 2 R d Rg N i=1 can be found minimizing the functional: X H[f]= N (y i ; f(x i )) 2 + ZRd ds j f(s)j ~ 2 ~G(s) i=1 where ~ G is a positive, even, function, decreasing to zero at innity, and a small, positive number.
3 Main result (Duchon, 1977 Meinguet, 1979 Wahba, 1977 Madych and Nelson, 1990 Poggio and Girosi, 1989 Girosi, 1992) The function that minimizes the functional X H[f]= N (y i ; f(x i )) 2 + ZRd ds j f(s)j ~ 2 ~G(s) i=1 has the form: X f(x)= N i=1 c i G(x ; x i )+ k X =1 d (x) where G is conditionally positive denite of order m, G ~ is the Fourier transform of G and f g k =1 is a basis in the space of polynomials of degree m ; 1.
4 Computation of the coecients The coecients are found by solving the linear system: (G + I)c +; T d = y ;c =0 where I is the identity matrix, and we have dened (y) i = y i (c) i = c i (d) i = d i (G) ij = G(x i ; x j ) (;) i = (x i )
5 Radial Basis Functions If the function ~ G is radial ( ~ G = ~ G(ksk) the smoothness functional is rotationally invariant, that is [f(x)] = [f(rx)] 8R 2 O(d) where O(d) is the group of rotations in d dimensions. The regularization solution becomes the so called Radial Basis Functions technique: f(x)= N X i=1 c i G(kx ; x i k)+p(x)
6 Radial Basis Functions Dene r = kxk G(x) =e ;r2 G(x) = gaussian q r 2 + c 2 multiquadric G(x) = 1 pc 2 +r 2 inverse multiquadric G(x) =r 2n+1 G(x) =r 2n ln r multivariate splines multivariate splines
7 Some good properties of RBF Very well motivated in the framework of regularization theory The solution is unique and equivalent to solve a linear system Amount of smoothness is tunable (with ) Radial Basis Functions are universal approximators Huge amount of literature on this subject (Interpretation in terms of \neural networks") Biologically plausible Simple interpretation in terms of\smooth look-up table" Similar to other non-parametric techniques, such as nearest neighbor and kernel regression
8 Some not-so-good properties of RBF Computationally expensive (O(N 3 ) where N =number of data) Linear system often badly ill-conditioned The same amount of smoothness is imposed on dierent regions of the domain
9 This function has dierent smoothness properties in dierent regions of its domain
10 Least Squares Regularization Networks We look for an approximation to the regularization solution: f(x)= N X f (x)= i=1 n X + =1 c i G(x ; x i ) c G(x ; t ) where n << N and the vectors t are called centers. (Broomhead and Lowe, 1988 Moody and Darken, 1989 Poggio and Girosi, 1989)
11 Least Squares Regularization Networks f (x)= n X =1 c G(x ; t ) Suppose the centers t have been xed. How do we nd the coecients c? + Least Squares
12 Dene Least Squares Regularization Networks X E(c1 ::: c n )= N (y i ; f (x i )) 2 i=1 The least squares criterion is min c E(c1 ::: c n ) The problem is convex and quadratic in the c, and the =0
13 Least Squares =0 + G T G c = G T y where we have dened (y) i = y i (c) = c (G) i = G(x i ; t ) Therefore c = G + y, where G + =(G T G) ;1 G T is the pseudoinverse of G.
14 Least Squares Regularization Networks Given the centers t we know howtondthe c. How do we choose the t? 1. a subset of the examples 2. by a clustering algorithm (k-means, for examples) 3. by least squares (\moving centers")
15 Centers as a subset of the examples Fair technique. The subset is a random subset, which should reect the distribution of the data. Not many theoretical results availabl. Main problem: how many centers? Main answer: we don't know. Cross validation techniques seem a reasonable choice.
16 Clustering in Rd Clustering a set of data points fx i g N i=1 means nding a set of m representative points fm k g m k=1. Let S k be the Voronoi polytopes of center m k : S k = fx : kx ; m k k < kx ; m j k j 6= kg A set of representative points M = fm k g m k=1 is optimal if solves: min M mx k=1 X x i 2S k kx i ; m k k
17 K-means algorithm The k-means algorithm (MacQueen, 1967) is an iterative technique for nding optimal representative points (cluster centers): m (t+1) k = 1 #S (t) k X x i 2S (t) k This algorithm is guaranteed to nd a local mininum. Many variations of this algorithm have been proposed. x i
18 K-means algorithm S 1 S 2 S 3
19 K-means algorithm S 1 S 2 S 3
20 K-means algorithm S 1 S 2 S 3
21 Finding the centers by clustering Very common. However it makes sense only if the input data points are clustered. No theoretical results. Not clear that it is agoodidea,especially for pattern classication cases.
22 Moving centers Dene X E(c1 ::: c n t 1 ::: t n )= N (y i ; f (x i )) 2 i=1 The least squares criterion is min E(c1 ::: c n t c t 1 ::: t n ) The problem is not convex and quadratic anymore: expect multiple local minima.
23 Moving centers :-) Very exible, in principle very powerful :-) Some theoretical understanding :-( Very expensive computationally due to the local minima problem :-( Centers sometimes move in \weird" ways
24 Some new approximation schemes Radial Basis Functions with moving centers is a particular case of a function approximation technique of the form: f(x)= n X =1 c H(x p ) where fp g n =1 is a set of parameters, which can be estimated by least squares techniques. Radial Basis Functions corresponds to the choice H(x p )=G(kx ; p k)
25 Neural Networks Neural networks correspond to a dierent choice. We set p (w ) and choose: H(x p )=(x w + ) where is a sigmoidal function. The resulting approximation scheme is f(x)= n X =1 c (x w + ) and it is called Multilayer Perceptron with one layer of hidden units
26 Examples of sigmoidal functions f(x)= 1 1+e ;x f(x) = tanh(x) Notice that when is very large the sigmoid approaches a step function.
27 One hidden layer Perceptron
28 Multilayer Perceptrons :-) Approximation schemes of this type have been successful in a number of cases :-( Interpretation of the approximation technique is not immediate :-) Multilayer Perceptrons are \universal approximator" :-) Some theoretical results available :-( Computationally expensive, local minima :- Motivation of this technique?
29 Multilayer perceptrons are particular cases of the more general ridge function approximation techniques: f(x)= n X =1 h (x w ) in which we have chosen h (x)=c h(x + ) for some xed function h (sigmoid).
30 Projection ursuit Regression (Friedman amd Stuetzle, 1981 Huber, 1986) Look for an approximating function of the form f(x)= n X =1 h (x w ) where w are unit vectors and h are functions to be tted to the data (often cubic splines, trigonometric polynomials or and Radial Basis Functions). The underlying idea is that all the information lies in few, one dimensional projections of the data. The unit vectors w are the \interesting projections" that have to be pursued.
31 PPR algorithm Assume that the rst ; 1 terms of the expansion have been determined, and de- ne the residuals r (;1) i = y i ; ;1 X =1 h (x i w ) Find the -th vector w and the -th function h that solve min w h NX (r (;1) i=1 i ; h (x i w )) 2 Go back to the rst step and iterate
32 Motivation of PPR Cluster analysis of high-dimensional set of data points D: Diaconis and Freedman (1984) show that for most high-dimensional clouds, most one-dimensional projections are approximately normal. In cluster analysis normal means \uninteresting". Therefore there are few interesting projections, that can be found maximizing a \projection index" that measures deviation from normality.
33 Principal Component Analysis
34 Let fx g N =1 be a set of points in Rd, with d possibly very large. Every point is specied by d numbers. Is there a more compact representation? It depends on the distribution of the data points in R d...
35 There are random points
36
37 and random points
38
39 Nr. Feature 1 pupil to nose vertical distance 2 pupil to mouth vertical distance 3 pupil to chin vertical distance 4 nose width 5 mouth width 6 face width halfway between nose tip and eyes 7 face width at nose position 8-13 chin radii 14 mouth height 15 upper lip thickness 16 lower lip thickness 17 pupil to eyebrow separation 18 eyebrow thickness
40 Features 8 to 14
41 Principal Component Analysis The data might lie (at least approximately) in a k-dimensional linear subspace of R d, with k<<d: X x k (x d i )d i i=1 where D = fd i g k i=1 vectors: is an orthonormal set of d i d j = ij Principal Component Analysis let us nd the best orthonormal set of vectors fd i g k i=1,which are called the rst k principal components.
42 The orthonormal basis D = fd i g k i=1 is found by solving the following least squares problem: min D NX =1 max D X (x ; k (x d i )d i ) 2 = NX i=1 =1 i=1 = 1 N max D kx (x d i ) 2 = kx i=1 d i Cd i where C is the correlation matrix: C ij = 1 NX x i N xj =1
43 Let us consider the case k =1,and d 1 = d. The problem is now to solve: 1 N max kdk=1 d Cd where C is a symmetrix matrix (correlation matrix). Setting C = R T R d = Rd with R a rotation matrix and diagonal, we have now to solve: 1 N kd max k=1 d d
44 Notice that, since kd k =1,then: d d max If we choose the vector d max such that d max =(0 0 ::: ::: 0) and d max = max d max then d max d max = max and d max maximizes the quadratic form d d.
45 The rst principal component, that is the vector d max that maximizes the quadratic form d Cd, is therefore d max = R T d max and it can be easily seen that satises the eigenvector equation: Cd max = max d max The rst principal component is therefore the eigenvector of the correlation matrix corresponding to the maximum eigenvalue.
46 Case k>1 It can be shown that the rst k principal components are the eigenvectors of the correlation matrix C corresponding to the rst k largest eigenvalues.
47 Try this at home: The approximation error of the rst k principal components fd i g k i=1 is given by X E(k)= N X (x ; k (x d i )d i ) 2 =1 i=1 E(k)= N X i=k+1 i where i are the eigenvalues of the correlation matrix C. Hint: when k = N the error is zero because the correlation matrix C is symmetric, and its eigenvectors form a complete basis.
48 An example in 20 dimensions: x 2 R 20, x 10 = x 8 + x 12, 500 points.
49 Correlation matrix
50 Correlation matrix Z Y X
51 Eigenvalues of the correlation matrix Eigenvalues
52 Another point of view on PCA Let fx g N =1 be a set of points in Rd. Problem: nd the unit vector d for which the projection of the data on d has the maximum variance. Assuming that the data points have zero mean, this means: max kdk=1 1 N NX (x d) 2 =1
53 This is formally the same as nding the rst principal component. The second principal component is the projection of maximum variance in a subspace orthogonal to the rst principal component.... The k-th principal component is the projection of maximum variance in a subspace wich is orthogonal to the subspace spanned by the rst k ; 1 principal components.
54 Extensions of Radial Basis Functions Dierent variables can have dierent scales: f(x y)=y 2 sin(100x) Dierent variables could have dierent units of measure f = f(x x x) Not all the variables are independent or relevant: f(x y z t)=g(x y z(x y)) Only some linear combinations of the variables are relevant: f(x y z)=sin(x + y + z)
55 Extensions of regularization theory A priori knowledge: the relevant variables are linear combination of the original ones: z = Wx for some (possibly rectangular) matrix W f(x) =g(wx) =g(z) and the function g is smooth The regularization functional is now X H[g]= N (y i ; g(z i )) 2 + [g] i=1 where z i = Wx i.
56 Extensions of regularization theory (continue) The solution is g(z)= N X i=1 c i G(z ; z i ) : Therefore the solution for f is: f(x)=g(wx)= N X i=1 c i G(Wx ; Wx i )
57 If the matrix W were known, the coecients could be computed as in the radial case: (G + I)c = y where (y) i = y i (c) i = c i (G) ij = G(Wx i ; Wx j ) and the same criticisms of the Regularization Networks technique apply, leading to Generalized Regularization Networks: f (x)= n X =1 c G(Wx ; Wt )
58 Since W is not known, it could be found by least squares. Dene X E(c1 c n W)= N (y i ; f (x i )) 2 Then we can solve: i=1 min c W E(c 1 c n W) The problem is not convex and quadratic anymore: expect multiple local minima.
59 From RBF to HyperBF When the basis function is radial the Generalized Regularization Networks becomes f(x)= n X =1 c G(kx ; t kw) that is a non radial basis function technique
60 Least Squares 1. min c E(c1 ::: c n ) 2. min c t E(c1 ::: c n t 1 ::: t n ) 3. min c W E(c1 c n W) 4. min c t W E(c1 ::: c n t 1 ::: t n W)
61 A non radial gaussian function
62 A non radial multiquadric function
63 Additive models An additive model has the form f(x)= d X f (x ) =1 where In other words f (x )= N X i=1 c i G(x ; x i ) f(x)= d X NX =1 i=1 c i G(x ; x i )
64 Extensions of Additive Models If we have less centers than examples we obtain: f(x)= d X f (x ) =1 where In other words f (x )= N X =1 c G(x ; t ) f(x)= d X NX =1 =1 c G(x ; t )
65 Extensions of Additive Models If we now allow for an arbitrary linear transformation of the inputs: x! Wx where W is a d 0 d matrix, we obtain: X f(x)= d0 nx =1 =1 c G(x w ; t ) where w is the -th row of the matrix W,
66 Extensions of Additive Models The expression X f(x)= d0 can be written as nx =1 =1 c G(x w ; t ) where X f(x)= d0 =1 h (x w ) h (y)= n X =1 c G(y ; t ) This form of approximation is called ridge approximation
67 From the extension of additive models we can therefore justify an approximation technique of the form X f(x)= d0 nx =1 =1 c G(x w ; t ) Particular case: n = 1 (one center). Then we derive the following technique: X f(x)= d0 =1 c G(x w ; t) which is a Multilayer Perceptron with a Radial Basis Functions G instead of the sigmoid function. Notice that the sigmoid function is not a Radial Basis Functions
68 Regularization Networks Regularization networks (RN) Radial Stabilizer RBF Regularization Additive Stabilizer Additive Splines Product Stabilizer Tensor Product Splines Generalized Regularization Networks (GRN) Movable metric Movable centers HyperBF if x = 1 Movable metric Movable centers Ridge Approximation Movable metric Movable centers
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL INFORMATION PROCESSING WHITAKER COLLEGE A.I. Memo No. 1287 August 1991 C.B.I.P. Paper No. 66 Models of
More informationA New Trust Region Algorithm Using Radial Basis Function Models
A New Trust Region Algorithm Using Radial Basis Function Models Seppo Pulkkinen University of Turku Department of Mathematics July 14, 2010 Outline 1 Introduction 2 Background Taylor series approximations
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCOS 424: Interacting with Data
COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,
More informationA Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski
A Characterization of Principal Components for Projection Pursuit By Richard J. Bolton and Wojtek J. Krzanowski Department of Mathematical Statistics and Operational Research, University of Exeter, Laver
More informationSlide05 Haykin Chapter 5: Radial-Basis Function Networks
Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation,
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationIn the Name of God. Lectures 15&16: Radial Basis Function Networks
1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training
More informationD. Shepard, Shepard functions, late 1960s (application, surface modelling)
Chapter 1 Introduction 1.1 History and Outline Originally, the motivation for the basic meshfree approximation methods (radial basis functions and moving least squares methods) came from applications in
More informationVectors in many applications, most of the i, which are found by solving a quadratic program, turn out to be 0. Excellent classication accuracies in bo
Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces Bernhard Scholkopf 1, Phil Knirsch 2, Alex Smola 1, and Chris Burges 2 1 GMD
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationLinear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space
Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationTechinical Proofs for Nonlinear Learning using Local Coordinate Coding
Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect
More informationProblem Set 9 Due: In class Tuesday, Nov. 27 Late papers will be accepted until 12:00 on Thursday (at the beginning of class).
Math 3, Fall Jerry L. Kazdan Problem Set 9 Due In class Tuesday, Nov. 7 Late papers will be accepted until on Thursday (at the beginning of class).. Suppose that is an eigenvalue of an n n matrix A and
More information2 Tikhonov Regularization and ERM
Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods
More informationChapter 4: Interpolation and Approximation. October 28, 2005
Chapter 4: Interpolation and Approximation October 28, 2005 Outline 1 2.4 Linear Interpolation 2 4.1 Lagrange Interpolation 3 4.2 Newton Interpolation and Divided Differences 4 4.3 Interpolation Error
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationINRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis
Regularized Gaussian Discriminant Analysis through Eigenvalue Decomposition Halima Bensmail Universite Paris 6 Gilles Celeux INRIA Rh^one-Alpes Abstract Friedman (1989) has proposed a regularization technique
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationStatistical Learning & Applications. f w (x) =< f w, K x > H = w T x. α i α j < x i x T i, x j x T j. = < α i x i x T i, α j x j x T j > F
CR2: Statistical Learning & Applications Examples of Kernels and Unsupervised Learning Lecturer: Julien Mairal Scribes: Rémi De Joannis de Verclos & Karthik Srikanta Kernel Inventory Linear Kernel The
More informationKernel Methods. Charles Elkan October 17, 2007
Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then
More informationSpatial Process Estimates as Smoothers: A Review
Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed
More informationUndercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.
Undercomplete Independent Component Analysis for Signal Separation and Dimension Reduction John Porrill and James V Stone Psychology Department, Sheeld University, Sheeld, S10 2UR, England. Tel: 0114 222
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationARTIFICIAL INTELLIGENCE LABORATORY. and. A.I. Memo No May, An Equivalence Between Sparse Approximation and. Support Vector Machines
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1606 May, 1997 C.B.C.L
More informationoutput dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1
To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationLinear Algebra, 4th day, Thursday 7/1/04 REU Info:
Linear Algebra, 4th day, Thursday 7/1/04 REU 004. Info http//people.cs.uchicago.edu/laci/reu04. Instructor Laszlo Babai Scribe Nick Gurski 1 Linear maps We shall study the notion of maps between vector
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More information1. Introduction Let f(x), x 2 R d, be a real function of d variables, and let the values f(x i ), i = 1; 2; : : : ; n, be given, where the points x i,
DAMTP 2001/NA11 Radial basis function methods for interpolation to functions of many variables 1 M.J.D. Powell Abstract: A review of interpolation to values of a function f(x), x 2 R d, by radial basis
More information. Consider the linear system dx= =! = " a b # x y! : (a) For what values of a and b do solutions oscillate (i.e., do both x(t) and y(t) pass through z
Preliminary Exam { 1999 Morning Part Instructions: No calculators or crib sheets are allowed. Do as many problems as you can. Justify your answers as much as you can but very briey. 1. For positive real
More informationonly nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr
The discrete algebraic Riccati equation and linear matrix inequality nton. Stoorvogel y Department of Mathematics and Computing Science Eindhoven Univ. of Technology P.O. ox 53, 56 M Eindhoven The Netherlands
More informationContents. 6 Systems of First-Order Linear Dierential Equations. 6.1 General Theory of (First-Order) Linear Systems
Dierential Equations (part 3): Systems of First-Order Dierential Equations (by Evan Dummit, 26, v 2) Contents 6 Systems of First-Order Linear Dierential Equations 6 General Theory of (First-Order) Linear
More informationAutomatic Capacity Tuning. of Very Large VC-dimension Classiers. B. Boser 3. EECS Department, University of California, Berkeley, CA 94720
Automatic Capacity Tuning of Very Large VC-dimension Classiers I. Guyon AT&T Bell Labs, 50 Fremont st., 6 th oor, San Francisco, CA 94105 isabelle@neural.att.com B. Boser 3 EECS Department, University
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationlinearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice
3. Eigenvalues and Eigenvectors, Spectral Representation 3.. Eigenvalues and Eigenvectors A vector ' is eigenvector of a matrix K, if K' is parallel to ' and ' 6, i.e., K' k' k is the eigenvalue. If is
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationReview and problem list for Applied Math I
Review and problem list for Applied Math I (This is a first version of a serious review sheet; it may contain errors and it certainly omits a number of topic which were covered in the course. Let me know
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationWhen can Deep Networks avoid the curse of dimensionality and other theoretical puzzles
When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles Tomaso Poggio, MIT, CBMM Astar CBMM s focus is the Science and the Engineering of Intelligence We aim to make progress
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationMATH 829: Introduction to Data Mining and Analysis Principal component analysis
1/11 MATH 829: Introduction to Data Mining and Analysis Principal component analysis Dominique Guillot Departments of Mathematical Sciences University of Delaware April 4, 2016 Motivation 2/11 High-dimensional
More informationError Empirical error. Generalization error. Time (number of iteration)
Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp
More informationContents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces
Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v 250) Contents 2 Vector Spaces 1 21 Vectors in R n 1 22 The Formal Denition of a Vector Space 4 23 Subspaces 6 24 Linear Combinations and
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationNew concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space
Lesson 6: Linear independence, matrix column space and null space New concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space Two linear systems:
More information. (a) Express [ ] as a non-trivial linear combination of u = [ ], v = [ ] and w =[ ], if possible. Otherwise, give your comments. (b) Express +8x+9x a
TE Linear Algebra and Numerical Methods Tutorial Set : Two Hours. (a) Show that the product AA T is a symmetric matrix. (b) Show that any square matrix A can be written as the sum of a symmetric matrix
More informationCurves clustering with approximation of the density of functional random variables
Curves clustering with approximation of the density of functional random variables Julien Jacques and Cristian Preda Laboratoire Paul Painlevé, UMR CNRS 8524, University Lille I, Lille, France INRIA Lille-Nord
More informationExercises * on Linear Algebra
Exercises * on Linear Algebra Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 7 Contents Vector spaces 4. Definition...............................................
More informationPreliminary Examination, Part I Tuesday morning, August 31, 2004
Preliminary Examination, Part I Tuesday morning, August 31, 24 This part of the examination consists of six problems. You should work all of the problems. Show all of your work in your workbook. Try to
More informationRadial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition
Radial Basis Function Networks Ravi Kaushik Project 1 CSC 84010 Neural Networks and Pattern Recognition History Radial Basis Function (RBF) emerged in late 1980 s as a variant of artificial neural network.
More informationVectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =
Linear Algebra Review Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1 x x = 2. x n Vectors of up to three dimensions are easy to diagram.
More informationLinear Algebra: Characteristic Value Problem
Linear Algebra: Characteristic Value Problem . The Characteristic Value Problem Let < be the set of real numbers and { be the set of complex numbers. Given an n n real matrix A; does there exist a number
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationWidths. Center Fluctuations. Centers. Centers. Widths
Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.
More informationMATH 590: Meshfree Methods
MATH 590: Meshfree Methods Chapter 34: Improving the Condition Number of the Interpolation Matrix Greg Fasshauer Department of Applied Mathematics Illinois Institute of Technology Fall 2010 fasshauer@iit.edu
More informationMath Camp Notes: Everything Else
Math Camp Notes: Everything Else Systems of Dierential Equations Consider the general two-equation system of dierential equations: Steady States ẋ = f(x, y ẏ = g(x, y Just as before, we can nd the steady
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationX/94 $ IEEE 1894
I Implementing Radial Basis Functions Using Bump-Resistor Networks John G. Harris University of Florida EE Dept., 436 CSE Bldg 42 Gainesville, FL 3261 1 harris@j upit er.ee.ufl.edu Abstract- Radial Basis
More information4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial
Linear Algebra (part 4): Eigenvalues, Diagonalization, and the Jordan Form (by Evan Dummit, 27, v ) Contents 4 Eigenvalues, Diagonalization, and the Jordan Canonical Form 4 Eigenvalues, Eigenvectors, and
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationOn the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationMATH 829: Introduction to Data Mining and Analysis Graphical Models I
MATH 829: Introduction to Data Mining and Analysis Graphical Models I Dominique Guillot Departments of Mathematical Sciences University of Delaware May 2, 2016 1/12 Independence and conditional independence:
More informationComputation. For QDA we need to calculate: Lets first consider the case that
Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the
More informationCalculating Forward Rate Curves Based on Smoothly Weighted Cubic Splines
Calculating Forward Rate Curves Based on Smoothly Weighted Cubic Splines February 8, 2017 Adnan Berberovic Albin Göransson Jesper Otterholm Måns Skytt Peter Szederjesi Contents 1 Background 1 2 Operational
More informationAbstract. Jacobi curves are far going generalizations of the spaces of \Jacobi
Principal Invariants of Jacobi Curves Andrei Agrachev 1 and Igor Zelenko 2 1 S.I.S.S.A., Via Beirut 2-4, 34013 Trieste, Italy and Steklov Mathematical Institute, ul. Gubkina 8, 117966 Moscow, Russia; email:
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationwhich arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i
MODULE 6 Topics: Gram-Schmidt orthogonalization process We begin by observing that if the vectors {x j } N are mutually orthogonal in an inner product space V then they are necessarily linearly independent.
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMotivating the Covariance Matrix
Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role
More informationTheorems. Least squares regression
Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most
More informationCPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018
CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,
More informationMathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector
On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationLeast Squares Approximation
Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationSTATISTICAL LEARNING SYSTEMS
STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationBANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1
BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationIdentication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.
Identication and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis by Marios M. Polycarpou and Petros A. Ioannou Report 91-09-01 September 1991 Identication and Control
More informationGroup Theory. 1. Show that Φ maps a conjugacy class of G into a conjugacy class of G.
Group Theory Jan 2012 #6 Prove that if G is a nonabelian group, then G/Z(G) is not cyclic. Aug 2011 #9 (Jan 2010 #5) Prove that any group of order p 2 is an abelian group. Jan 2012 #7 G is nonabelian nite
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More information