Spectral Filtering for MultiOutput Learning

Size: px
Start display at page:

Download "Spectral Filtering for MultiOutput Learning"

Transcription

1 Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy

2 Plan Learning with kernels Multioutput kernel and regularization Spectral filtering Perspectives

3 Scalar Case function estimation from samples f : R d R (x i,y i ) n i=1. ble functio kernel models f = j K(x j, )c j

4 Kernels and Regularization RKHS: Definitions Hilbert space of functions H,, H such that k : R d R d R and k(x, ) H and f(x) = f, k(,x) H Tikhonov Regularization min f H { 1 n n (y i f(x i )) 2 + λ f 2 H}. i=1

5 Kernel Design feature map K(x, s) = Φ(x), Φ(s) regularizers J(f) = f 2 H

6 Multiple Outputs vector functions f : R d R T samples fixed we can (x T i,yt i )n T i=1 kernel models f = j K(x j, )c j c j R T

7 RKHS RKHS: Definitions Hilbert space of functions H,, H such that K : R d R d R T T and for c R T K(x, )c H and f(x) = f, K(,x)c H Tikhonov Regularization min f=(f 1,...,f T ) H { T j=1 1 n T n (y j i f j (x j i ))2 + λ f 2 H}. i=1 epresenter theorem shows that regularized solution can

8 Which Kernels? Component wise definition K :(R d,t) (R d,t) R K((x, t), (x,t )) A general class of kernels K(x, x )= r k r (x, x )A r

9 Kernels and Regularizers Consider K(x, x )=k(x, x )A Then f 2 H = j,i A j,i f j,f i k with f =(f 1,,f T )

10 Example: Mixed Effect Γ ω (x, x )=K(x, x )(ω1 + (1 ω)i) matrix whose entries are all equal to 1 J(f) =A ω B ω T f l 2 K + ωt T f l 1 T T f q 2 K l=1 l=1 q=1 = and = (2 2 + ) It is composed o

11 Example: Clustering Outputs M specifies the clusters G lq = ɛ 1 δ lq +(ɛ 2 ɛ 1 )M lq. ( ). K(x, x )=k(x, x )G J(f) =ɛ 1 r f l f c 2 K + ɛ 2 r m c f c 2 K, c=1 l I(c) c=1 22

12 Example: Graph M is an adjacency matrix among the tasks L = D M, K(x, x )=k(x, x )L J(f) = 1 2 T f l f q 2 KM lq + T f l 2 KM ll. l,q=1 l=1 ) can be rewritten as:

13 Inference and Computations Least Squares and Tikhonov c =(K + λni) 1 Y Kernel Matrix is (Tn T ) (Tn T ) c, Y are Tn T Computing the solution for N different regularization parameter is expensive O(N(Tn T ) 3 )

14 Ill-posed Problems Well-posedness in the sense of Hadamard A solution exists The solution is unique The solution depends continuously on the data Problems that are not well-posed are termed ill-posed. Well-posedness i Af = g A g H K

15 Regularization and Filtering c = i 1 σ i + λn u i,y u i Spectral Filtering c = i G λ (σ i ) u i,y u i

16 Regularization and Filtering 1 σ j Y, u j G λ (σ j ) Y, u j Low pass filter large eigenvalues small eigenvalues j

17 Classical Examples Tikhonov Regularization G λ (σ) = 1 σ + λ

18 Other Examples Many other Examples of Filters (only some known in machine learning) TSVD (principal component regression) Landweber iteration (L 2 boosting) ν- method iterated Tikhonov (Engl et al., Rosasco et l. 5, Lo Gerfo et al. 8, Bauer et. al. 5)

19 Early Stopping The filter correspond to a truncated expansion of the inverse. G λ (σ) =η t (1 ησ) j 1 σ j=1 A 1 η t (I ηa) j j=1 Implementation set for i= c i α = 1,..., t α i = α i 1 + η(y Kα i 1 ) Estimator n f t = αik(x t i, ) i=1

20 Early stopping at work Y X

21 Early stopping at work Y X

22 Early stopping at work Y X

23 Early stopping at work Y X

24 Remarks Empirical risk minimization with no constraints Regularization parameter is t: iteration regularizes No need of SVD Only matrix/vector multiplication O(N(Tn T ) 2 )

25 Fast Solution for Tikhonov Regularization For Kernel of the form K(x, x )=k(x, x )A we can diagonalize A and rotate data. Tikhonov Regularization can be solved at the price of a single task!

26 Vector fields v 1 (x, y) = 2sin(3x)sin(1.5y) v 2 (x, y) = 2cos(3y)cos(1.5x) + Convolution with a Gaussian

27 Useful Kernels Divergence Free Γ df (x, x )= 1 x y 2 e 2σ σ2 2 A x,x A x,x = (x x )(x x ) T σ 2 + ((T 1) x x 2 ) σ 2 I «««! Curl Free Γ cf (x, x )= 1 x x 2 σ 2 e 2σ 2 I x x σ «x x σ «T!.

28 Numerical Results TRUE FIELD ESTIMATED FIELD ESTIMATED FIELD TRUE FIELD ESTIMATED FIELD NOISELESS CASE.25.2 (b) DIVERGENCE FREE PART CURL FREE PART (a) CURL FREE PART NOISELESS CASE.25 INDIP VVR.15.2 (c) (d).1 Figure 1: (a) Visualization of the artificial vector field Test Error DIVERGENCE FREE PART (b) Test Error GENCE FREE PART.15 without noise. (b) Field estimated from a noisy training set of 1 examples. (c-d) Divergence-free part and curlfree part estimated with the corresponding kernels..5 (d) (c).1.5 (d) noise. (b) Field estimated from a noisy training examples. Divergence-free part and curligure 1: (a) (c-d) Visualization of the artificial vector field estimated with corresponding kernels. ithout noise. (b)the Field estimated from a noisy training et of 1 examples. (c-d) Divergence-free part and curlee part estimated with the corresponding kernels. f points from this grid and their predictions hole grid compared to the field. The Saturday, 12,from 29 this ample ofdecember points grid correct and their predictions.2 PROPORTIONAL NOISE INDIP VVR Number of training examples Number of training examples Figure 2: Test errors for the proposed vector valued ap-.15and for learning each component of the field proach (VVR) independently (INDIP). Noiseless case (left) and field with gaussian noise of width equal to 2% of the field magnitude (right)..5 sample of points.1.1 from this grid and their predictions on the whole grid compared to the correct field. The number of training examples is varied from 1 to 2 and for each cardinality the experiment is repeated 1 times with different sampling of the training points Following (?), we use an angular measure of error to compare two fields. If vo = (vo1, vo2 ) and ve = (ve1, ve2 ) are the original and estimated fields, we consider the transformation v v = (v1,v12,1) (v 1, v 2, 1). The er err = arccos(v v ). This error ror measure is then 1 2 e o 1 2 Computation time comparison measure was derived by interpreting the vector field Number of training examples Number of training examples 7!!method as a velocity field and it is convenient because it hantikhonov regularization 6 dles large and small signals without the amplification inherent in a relative measure of vector differences. 5 We consider the noiseless case and the case where the 4 vector field is corrupted with gaussian noise, which we assume to be proportional to the signal. This means 3 that for each point of the field, the standard deviation 2 of the noise in that point is proportional to the magnitude of the field. The accuracy obtained in both cases 1 is shown in Fig.??, from which the advantage of using the combination of divergence-free and curl-free kernumber of training examples nels is apparent, especially for fewer training points. Fig.?? (b) shows an example of estimated field and Figure 3: Computation time (in seconds) for the whole its decomposition (c-d) into a divergence-free part and regularization path the ν-method and for Tikhonov regucurl-free part. 1 2 Number of training examples : (a) Visualization of the artificial vector field INDIP VVR INDIP VVR 2 Number of training examples Figure 2: Test errors for the proposed vector valued approach (VVR) and for learning each component of the field independently casevector (left)valued and field Figure 2: Test(INDIP). errors for Noiseless the proposed ap- with gaussian noise of equal each to 2% of the field magnitude proach (VVR) andwidth for learning component of the field (right). independently (INDIP). Noiseless case (left) and field with Time (seconds) (c).15 PROPORTIONAL NOISE CURL FREE PART Test Error (a) INDIP NOISELESS CASE.25 INDIP VVR VVR Test Error (b).25 Test Error (a) PROPORTIONAL NOISE Test Error RUE FIELD gaussian noise of width equal to 2% of the field magnitude (right).

29 : (a) Visualization of the artificial vector field Figure 2: Test errors for the proposed vector valued ap- oise. (b) Field estimated from a noisy training examples. (c-d) Divergence-free part and curlestimated with the corresponding kernels. proach (VVR) and for learning each component of the field independently (INDIP). Noiseless case (left) and field with gaussian noise of width equal to 2% of the field magnitude (right). Numerical Results f points from this grid and predictions TRUEtheir FIELD ESTIMATED FIELD hole grid compared to the correct field. The of training examples is varied from 1 to 2 NOISELESS CASE PROPORTIONAL NOISE TRUE FIELD ESTIMATED FIELD ach cardinality the experiment is repeated 1 (a) th different sampling of the training points. (b) DIVERGENCE FREE PART CURL FREE PART g (?), we use an angular measure of error to two fields. If vo = (vo1, vo2 ) and ve = (ve1, ve2 ) NOISELESS CASE PROPORTIONAL NOISE riginal and estimated fields, we consider the INDIP mation v v = (v1,v2,1) (v (c), v, 1). The er-(d) INDIP VVR VVR (a) (b) ure is then err = arccos(v e v o ). This error.2.2 Figure 1: (a) Visualization of the artificial vector field Figure 2: Test errors for thecomputation proposed vector valued time apcomparison without noise. the (b) Field estimated from a noisy training was derived by interpreting vector field proach (VVR) and for learning each component of the field DIVERGENCE FREE PARTset of 1 examples. CURL FREE PART (c-d) Divergence-free part and curl7 independently (INDIP). Noiseless case (left) and field with free part estimated with the corresponding kernels. gaussian noise of width equal to 2% of the field magnitude city field and it is convenient because it han!!method (right). Tikhonov regularization 6 e and small signals without the from amplification sample of points this grid and their predictions on the whole grid compared to the correct field. The.1.1 in a relative measure of vector differences. number of training examples is varied from 1 to 2 5 and for each cardinality the experiment is repeated 1 ider the noiseless case times andwiththe case where the different sampling of the training points Following (?), noise, we use an angular measure of error to ld is corrupted with gaussian which we compare two fields. If v = (v, v ) and v = (v, v ) are thesignal. original and estimated fields, we consider the o be proportional to the This means 3 (v, v, 1). The ertransformation v v = (d) (c) measure is then err =deviation arccos(v v ). This error each point of the field, ror the standard Number of training examples Number of training examples measure was derived by interpreting the vector field 2 as a velocity field to and itthe is convenient because it hanise in that point is proportional magnidles large and small signals without the amplification igure 1: The (a) Visualization of the artificial vector field he field. accuracy inherent obtained inmeasure both cases 12: Test errors for the proposed vector valued apin a relative of vector differences. Figure ithout noise. (b) Field estimated from a noisy training We consider the noiseless case and the case where theproach (VVR) and for learning each component of the field in Fig.??, from which the advantage using vector field is corrupted part withof gaussian noise, which we et of 1 examples. (c-d) Divergence-free and curl (INDIP). Noiseless case (left) and field with assume to be proportional to the signal. This meansindependently ee part estimated with the corresponding kernels. bination of divergence-free and ker- deviationgaussian noise of width that for each pointcurl-free of the field, the standard equal to 2% of the field magnitude Number of training examples of the noise in that point is proportional to the magnipparent, especially fortude fewer training of the field. The accuracypoints. obtained in both cases(right). is shown in Fig.??, from which the advantage of using b) shows an example of combination estimated field and divergence-free and curl-free kersaturday,of December 29 thisthegrid ample points12,from and oftheir predictions Test Error Test Error e 1 (v 1,v 2,1) 1 e 2 o 1 e 2 e 2 Test Error Time (seconds) 2 o Computation time comparison 7 6 Time (seconds) 1 o 1 Number of training examples Test Error 1 2 Number of training examples o INDIP VVR INDIP VVR!!method Tikhonov regularization Number of training examples 2

30 T x f(x) = 1 n Some Theory: Random Operators n K(x, x i )f(x i ) Tf(x) = i=1 K(x, x)f(x)dρ(x) P ( T T x Ct ) n 1 e t2 The above result implies convergence of eigenfunctions and eigenvalues

31 Learning Rates Theorem If T r f ρ R with r > 1/2 and σ i i 1/b, b > 1, then ( ) P f n f ρ 2 ρ C τn 2rb 2rb+1 1 e τ 2 for λ = n 1 2rb+1. The above estimate is optimal in a minimax sense. Parameter choice can be done adaptively (Caponnetto et al., b=1 Bauer at al. )

32 Comments One name, 3 problems? Learning the kernels?

33 Vector Fields and Multi-tasks Y Task 1 Y Component 1 Task 2 X X Component 2 X X

34 Different Regimes? n>d>t (classical) d>n (high dimensional inference) T>n, n>t (??) curse of dimensionality vs blessing of smoothness smoothness/d should be big

35 Multiple Classes Inputs belong to one of T classes Journal of Machine Learning Research 5 (24) Submitted 4/3; Revised 8/3; Published 1/4 In Defense of One-Vs-All Classification Ryan Rifkin Honda Research Institute USA 145 Tremont Street Boston, MA , USA Aldebaro Klautau UC San Diego La Jolla, CA , USA rif@alum.mit.edu a.klautau@ieee.org

36 One Versus All Coding 1 = (1, 1, 1, ), 2=( 1, 1, 1, )... Regression of Coding Vectors min f=(f 1,,f T ) n i=1 y i f(x i ) 2 T + λ T j=1 f j 2 Classification Rule c(x) = max f j (x) j=1,,t

37 Remarks No correlation among classes How can we estimate it? In simulation one observe improvement in probability estimation but NOT in classification performances.

38 Regression vs Classification the components of the regression function are proportional of the conditional probabilities of each class the obtained estimator is Bayes Consistent

39 Learning the Kernel? Bayesian Approaches (consistency guarantees/stability/computability?) Regularization (what is the underlying Kernel? How are the outputs related?)

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Dipartimento di Fisica

Dipartimento di Fisica Dipartimento di Fisica Multi-Output Learning with Spectral Filters by Luca Baldassarre DIFI, Università di Genova Via Dodecaneso 33, 16146 Genova, Italy http://www.fisica.unige.it/ Dottorato di Ricerca

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology

Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 2007 Technische Universiteit Eindh ove n University of Technology Inverse problems Total Variation Regularization Mark van Kraaij Casa seminar 23 May 27 Introduction Fredholm first kind integral equation of convolution type in one space dimension: g(x) = 1 k(x x )f(x

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods

Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Empirical Risk Minimization as Parameter Choice Rule for General Linear Regularization Methods Frank Werner 1 Statistical Inverse Problems in Biophysics Group Max Planck Institute for Biophysical Chemistry,

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Linear Diffusion and Image Processing. Outline

Linear Diffusion and Image Processing. Outline Outline Linear Diffusion and Image Processing Fourier Transform Convolution Image Restoration: Linear Filtering Diffusion Processes for Noise Filtering linear scale space theory Gauss-Laplace pyramid for

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute

More information

Non-Gaussian likelihoods for Gaussian Processes

Non-Gaussian likelihoods for Gaussian Processes Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations GP regression Model

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Erkut Erdem. Hacettepe University February 24 th, Linear Diffusion 1. 2 Appendix - The Calculus of Variations 5.

Erkut Erdem. Hacettepe University February 24 th, Linear Diffusion 1. 2 Appendix - The Calculus of Variations 5. LINEAR DIFFUSION Erkut Erdem Hacettepe University February 24 th, 2012 CONTENTS 1 Linear Diffusion 1 2 Appendix - The Calculus of Variations 5 References 6 1 LINEAR DIFFUSION The linear diffusion (heat)

More information

MLCC 2015 Dimensionality Reduction and PCA

MLCC 2015 Dimensionality Reduction and PCA MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material

Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Dimensionality Reduction Using the Sparse Linear Model: Supplementary Material Ioannis Gkioulekas arvard SEAS Cambridge, MA 038 igkiou@seas.harvard.edu Todd Zickler arvard SEAS Cambridge, MA 038 zickler@seas.harvard.edu

More information

A Kernel on Persistence Diagrams for Machine Learning

A Kernel on Persistence Diagrams for Machine Learning A Kernel on Persistence Diagrams for Machine Learning Jan Reininghaus 1 Stefan Huber 1 Roland Kwitt 2 Ulrich Bauer 1 1 Institute of Science and Technology Austria 2 FB Computer Science Universität Salzburg,

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017

One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017 One Picture and a Thousand Words Using Matrix Approximtions October 2017 Oak Ridge National Lab Dianne P. O Leary c 2017 1 One Picture and a Thousand Words Using Matrix Approximations Dianne P. O Leary

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

Least Squares Approximation

Least Squares Approximation Chapter 6 Least Squares Approximation As we saw in Chapter 5 we can interpret radial basis function interpolation as a constrained optimization problem. We now take this point of view again, but start

More information

Inverse Theory. COST WaVaCS Winterschool Venice, February Stefan Buehler Luleå University of Technology Kiruna

Inverse Theory. COST WaVaCS Winterschool Venice, February Stefan Buehler Luleå University of Technology Kiruna Inverse Theory COST WaVaCS Winterschool Venice, February 2011 Stefan Buehler Luleå University of Technology Kiruna Overview Inversion 1 The Inverse Problem 2 Simple Minded Approach (Matrix Inversion) 3

More information

DETERMINATION OF AN UNKNOWN SOURCE TERM IN A SPACE-TIME FRACTIONAL DIFFUSION EQUATION

DETERMINATION OF AN UNKNOWN SOURCE TERM IN A SPACE-TIME FRACTIONAL DIFFUSION EQUATION Journal of Fractional Calculus and Applications, Vol. 6(1) Jan. 2015, pp. 83-90. ISSN: 2090-5858. http://fcag-egypt.com/journals/jfca/ DETERMINATION OF AN UNKNOWN SOURCE TERM IN A SPACE-TIME FRACTIONAL

More information

Inverse problem and optimization

Inverse problem and optimization Inverse problem and optimization Laurent Condat, Nelly Pustelnik CNRS, Gipsa-lab CNRS, Laboratoire de Physique de l ENS de Lyon Decembre, 15th 2016 Inverse problem and optimization 2/36 Plan 1. Examples

More information

Regularization Theory

Regularization Theory Regularization Theory Solving the inverse problem of Super resolution with CNN Aditya Ganeshan Under the guidance of Dr. Ankik Kumar Giri December 13, 2016 Table of Content 1 Introduction Material coverage

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

9.520 Problem Set 2. Due April 25, 2011

9.520 Problem Set 2. Due April 25, 2011 9.50 Problem Set Due April 5, 011 Note: there are five problems in total in this set. Problem 1 In classification problems where the data are unbalanced (there are many more examples of one class than

More information

Lecture Semidefinite Programming and Graph Partitioning

Lecture Semidefinite Programming and Graph Partitioning Approximation Algorithms and Hardness of Approximation April 16, 013 Lecture 14 Lecturer: Alantha Newman Scribes: Marwa El Halabi 1 Semidefinite Programming and Graph Partitioning In previous lectures,

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so

More information

Data dependent operators for the spatial-spectral fusion problem

Data dependent operators for the spatial-spectral fusion problem Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Learning, Regularization and Ill-Posed Inverse Problems

Learning, Regularization and Ill-Posed Inverse Problems Learning, Regularization and Ill-Posed Inverse Problems Lorenzo Rosasco DISI, Università di Genova rosasco@disi.unige.it Andrea Caponnetto DISI, Università di Genova caponnetto@disi.unige.it Ernesto De

More information

Dipartimento di Informatica e Scienze dell Informazione

Dipartimento di Informatica e Scienze dell Informazione Dipartimento di Informatica e Scienze dell Informazione A Statistical Learning approach to Liver Iron Overload Estimation Luca Baldassarre 1,2, Barbara Gianesin 1, Annalisa Barla 2 and Mauro Marinelli

More information

Regularization methods for large-scale, ill-posed, linear, discrete, inverse problems

Regularization methods for large-scale, ill-posed, linear, discrete, inverse problems Regularization methods for large-scale, ill-posed, linear, discrete, inverse problems Silvia Gazzola Dipartimento di Matematica - Università di Padova January 10, 2012 Seminario ex-studenti 2 Silvia Gazzola

More information

Ill-Posedness of Backward Heat Conduction Problem 1

Ill-Posedness of Backward Heat Conduction Problem 1 Ill-Posedness of Backward Heat Conduction Problem 1 M.THAMBAN NAIR Department of Mathematics, IIT Madras Chennai-600 036, INDIA, E-Mail mtnair@iitm.ac.in 1. Ill-Posedness of Inverse Problems Problems that

More information

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC) Counterfactuals via Deep IV Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC) Endogenous Errors y = g p, x + e and E[ p e ] 0 If you estimate this using naïve ML,

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

Parameterized Expectations Algorithm

Parameterized Expectations Algorithm Parameterized Expectations Algorithm Wouter J. Den Haan London School of Economics c by Wouter J. Den Haan Overview Two PEA algorithms Explaining stochastic simulations PEA Advantages and disadvantages

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information