Advances in kernel exponential families

Size: px
Start display at page:

Download "Advances in kernel exponential families"

Transcription

1 Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, /39

2 Outline Motivating application: Fast estimation of complex multivariate densities The infinite exponential family: Multivariate Gaussian! Gaussian process Finite mixture model! Dirichlet process mixture model Finite exponential family!??? In this talk: Guaranteed speed improvements by Nystrom Conditional models 2/39

3 Goal: learn high dimensional, complex densities Red Wine Parkinsons x x x x15 We want: Efficient computation and representation Statistical guarantees 3/39

4 The exponential family The exponential family in in R d p(x ) = exp Examples: 0 * {z} natural parameter Gaussian density: T (x ) = Gamma density: T (x ) = h ; T (x ) {z } sufficient startistic h x ln x x 2 i Can we extend this to infinite dimensions? x i + A() {z } log normaliser 1 C A q 0 (x ) {z } base measure 4/39

5 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! 5/39

6 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, Exponentiated quadratic kernel k (x ; x 0 ) = exp kx x 0 k 2 '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! Features: Gaussian Processes for Machine learning, Rasmussen and Williams, Ch. 4. 5/39

7 Functions of infinitely many features Functions are linear combinations of features: 6/39

8 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39

9 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39

10 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x f` = P m i=1 i '`(x i ) 7/39

11 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 8/39

12 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 Finite dimensional RKHS: one-to-one correspondence between finite dimensional exponential family and RKHS. Example: Gaussian kernel, T (x ) = k (x ; y) = xy + x 2 y 2 h i x x 2 = '(x ) and 8/39

13 Fitting an infinite dimensional exponential family Given random samples, X 1 ; : : : ; X n drawn i.i.d. from an unknown density, p 0 := p f0 2 P, estimate p 0 9/39

14 How not to do it: maximum likelihood Maximum likelihood: f ML = arg max f 2F = arg max f 2F nx i=1 nx i=1 log p f (X i ) Z f (X i ) n log Solving the above yields that f ML satisfies where p fml = d P ML. 1 n nx i=1 '(x i ) = Z '(x )p fml (x ) e f (x ) q 0 (x ) : Ill posed for infinite dimensional '(x )! 10/39

15 Score matching Loss is Fisher Score: D F (p 0 ; p f ) := 1 2 Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 11/39

16 Score matching (general version) Assuming p f to be differentiable (w.r.t. x ) and R p0 (x )kr x log p f (x )k 2 < 1; 8 2 D F (p 0 ; p f ) := 1 2 (a) = Z Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 dx log pf (x 2 ) p 0 (x ) 2 log p f (x ) i Z i=1 p 0 (x log p 2 i where partial integration is used in (a) under the condition that p 0 (x log p f (x i 2! 0 as x i! 1; 8 i = 1; : : : ; d! 12/39

17 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. 13/39

18 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. Add extra term kf k 2 H to regularize. 13/39

19 A kernel solution Infinite exponential log p f (x ) = p f (x ) = e hf ;'(x )i H A(f ) q 0 hf ; '(x )i log q 0(x ): 14/39

20 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j i '(X log q 0(x ): d+j k (X ; X 0 ) 14/39

21 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j By representer theorem: f n = ^ + + i '(X ) dx = `=1 log q 0(x ): d+j k (X ; X 0 j 14/39

22 An RKHS solution The RKHS solution f n = ^ + nx dx `=1 j j Need to solve a linear system n = 1 0 G {z XX } nd nd + ni 1 C A 1 h X Very costly in high dimensions! f(x) x 15/39

23 The Nystrom approximation 16/39

24 Nystrom approach for efficient solution Find best estimator f n ;m in H Y := span f@ i k (y a ; )g ; where a2[m];i2[d] y a 2 fx i g n i=1 chosen at random. Nystrom solution: 0 1y n B ;m 1 n B XY > C B {z XY + G } {z YY } A md nd md md Solve in time O(nm 2 d 3 ), evaluate in time O(md): Sill cubic in d, but similar results if we take a random dimension per datapoint. h Y 17/39

25 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) 18/39

26 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. 18/39

27 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) 19/39

28 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. Same rate but saturates sooner. Full KL original saturates at O p0 n 1 2 Nystrom saturates at O p0 n /39

29 Experimental results: ring Sample: Score: /39

30 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39

31 Experimental results: grid of Gaussians Sample: Score: /39

32 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39

33 The kernel conditional exponential family 24/39

34 The kernel conditional exponential family Can we take advantage of the graphical structure of (X 1 ; :::; X d )? Start from a general factorization of P P (X 1 ; :::; X d ) = Y i P (X i j X (i) {z } parents of X i ) Estimate each factor independently 25/39

35 Kernel conditional exponential family General definition, kernel conditional exponential family [Smola and Canu, 2006] p f (yjx ) = e hf ; (x ;y)i H A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hf ; (x ;y)ih dy (joint feature map (x ; y)) 26/39

36 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. 27/39

37 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H 28/39

38 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H x : H! G is a linear operator 28/39

39 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G x : G! H is a linear = hf ; x (y)i H operator. The feature map (x ; y) := x (y) 28/39

40 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. 29/39

41 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. Our loss function: ed F (p 0 ; p f ) := Z D F (p 0 (yjx )jjp f (yjx ))(x ) for some (x ) that includes the support of p(x ): 29/39

42 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) 30/39

43 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) Solution: where f n (yjx ) = nx dx b=1 i=1 n 1 = (b;i)k (X b ; x )@ i K(Y b ; y) + ^ (G + ni ) 1 h (G) (a ;i);(b;j ) =k (X a ; X b )@ j +d K(Y a ; Y b ); and h(y); (y 0 )i G = K(y; y 0 ). 30/39

44 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

45 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

46 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

47 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39

48 Expected conditional score: a failure case Why does it fail? Recall ed F (p 0 (yjx ); p f (yjx )) := Z (x )D F (p 0 (yjx ); p f (yjx )) Note that D F (p(yjx = 1) {z } target {z } model ; p(y) ) = Z p(yj1) kr x log p(yj1) r x log p(y)k 2 dy Model p(y) puts mass where target conditional p(yj1) has no support. Care needed when this failure mode approached! 32/39

49 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Parkinsons Red Wine Dimension Samples /39

50 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] 34/39

51 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] Negative log likelihoods (smaller is better, average over 5 test/train splits) Parkinsons Red wine KCEF 2:86 0:77 11:8 0:93 LSCDE 15:89 1:48 14:43 1:5 NADE 3:63 0:0 9:98 0:0 34/39

52 Results: unconditional model Red Wine Parkinsons 6 Data KEF Data KEF x x6 x x15 35/39

53 Results: conditional model x Red Wine Data KCEF x6 x Parkinsons 1 1 x15 Data KCEF 36/39

54 Co-authors From Gatsby: Michael Arbel Heiko Strathmann Dougal Sutherland External collaborators: Kenji Fukumizu Bharath Sriperumbudur Questions? 37/39

55 Score matching: 1-D proof D F (p 0 ; p f ) = 1 2 Z b a p 0 (x ) d log p0 (x ) d log p f (x 2 ) 38/39

56 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a 38/39

57 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

58 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

59 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39

60 39/39

Adaptive HMC via the Infinite Exponential Family

Adaptive HMC via the Infinite Exponential Family Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Kernel Adaptive Metropolis-Hastings

Kernel Adaptive Metropolis-Hastings Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Big Hypothesis Testing with Kernel Embeddings

Big Hypothesis Testing with Kernel Embeddings Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Lecture 1a: Basic Concepts and Recaps

Lecture 1a: Basic Concepts and Recaps Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

p(z)

p(z) Chapter Statistics. Introduction This lecture is a quick review of basic statistical concepts; probabilities, mean, variance, covariance, correlation, linear regression, probability density functions and

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Carol Hsin Abstract The objective of this project is to return expected electricity

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Introduction to Probabilistic Graphical Models: Exercises

Introduction to Probabilistic Graphical Models: Exercises Introduction to Probabilistic Graphical Models: Exercises Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Exercise 1: basics

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Causal Modeling with Generative Neural Networks

Causal Modeling with Generative Neural Networks Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeddings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.edu, bodai@gatech.edu Abstract Kernel embedding of

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

The Gaussian distribution

The Gaussian distribution The Gaussian distribution Probability density function: A continuous probability density function, px), satisfies the following properties:. The probability that x is between two points a and b b P a

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Parameter estimation Conditional risk

Parameter estimation Conditional risk Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) The Poisson transform for unnormalised statistical models Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) Part I Unnormalised statistical models Unnormalised statistical models

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

dt 2 The Order of a differential equation is the order of the highest derivative that occurs in the equation. Example The differential equation

dt 2 The Order of a differential equation is the order of the highest derivative that occurs in the equation. Example The differential equation Lecture 18 : Direction Fields and Euler s Method A Differential Equation is an equation relating an unknown function and one or more of its derivatives. Examples Population growth : dp dp = kp, or = kp

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information