Advances in kernel exponential families
|
|
- Janel Shepherd
- 5 years ago
- Views:
Transcription
1 Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, /39
2 Outline Motivating application: Fast estimation of complex multivariate densities The infinite exponential family: Multivariate Gaussian! Gaussian process Finite mixture model! Dirichlet process mixture model Finite exponential family!??? In this talk: Guaranteed speed improvements by Nystrom Conditional models 2/39
3 Goal: learn high dimensional, complex densities Red Wine Parkinsons x x x x15 We want: Efficient computation and representation Statistical guarantees 3/39
4 The exponential family The exponential family in in R d p(x ) = exp Examples: 0 * {z} natural parameter Gaussian density: T (x ) = Gamma density: T (x ) = h ; T (x ) {z } sufficient startistic h x ln x x 2 i Can we extend this to infinite dimensions? x i + A() {z } log normaliser 1 C A q 0 (x ) {z } base measure 4/39
5 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! 5/39
6 Infinitely many features using kernels Kernels: dot products of features Feature map '(x ) 2 H, Exponentiated quadratic kernel k (x ; x 0 ) = exp kx x 0 k 2 '(x ) = [: : : ' i (x ) : : :] 2 `2 For positive definite k, k (x ; x 0 ) = h'(x ); '(x 0 )i H = k (x ; ); k (x 0 ; ) H Infinitely many features '(x ), dot product in closed form! Features: Gaussian Processes for Machine learning, Rasmussen and Williams, Ch. 4. 5/39
7 Functions of infinitely many features Functions are linear combinations of features: 6/39
8 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39
9 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x 7/39
10 How to represent functions? Function with exponentiated quadratic kernel: f (x ) : = mx i=1 mx = i=1 = = * m X i=1 1X `=1 i k (x i ; x ) i h'(x i ); '(x )i H i '(x i ); '(x ) f`'`(x ) + H f(x) x f` = P m i=1 i '`(x i ) 7/39
11 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 8/39
12 The kernel exponential family Kernel exponential families [Canu and Smola (2006), Fukumizu (2009)] and their GP counterparts [Adams, Murray, MacKay (2009), Rasmussen(2003)] n o P = p f (x ) = e hf ;'(x )i H A(f ) q 0 (x ); x 2 ; f 2 F where F = Z f 2 H : A(f ) = log e f (x ) q 0 (x ) < 1 Finite dimensional RKHS: one-to-one correspondence between finite dimensional exponential family and RKHS. Example: Gaussian kernel, T (x ) = k (x ; y) = xy + x 2 y 2 h i x x 2 = '(x ) and 8/39
13 Fitting an infinite dimensional exponential family Given random samples, X 1 ; : : : ; X n drawn i.i.d. from an unknown density, p 0 := p f0 2 P, estimate p 0 9/39
14 How not to do it: maximum likelihood Maximum likelihood: f ML = arg max f 2F = arg max f 2F nx i=1 nx i=1 log p f (X i ) Z f (X i ) n log Solving the above yields that f ML satisfies where p fml = d P ML. 1 n nx i=1 '(x i ) = Z '(x )p fml (x ) e f (x ) q 0 (x ) : Ill posed for infinite dimensional '(x )! 10/39
15 Score matching Loss is Fisher Score: D F (p 0 ; p f ) := 1 2 Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 11/39
16 Score matching (general version) Assuming p f to be differentiable (w.r.t. x ) and R p0 (x )kr x log p f (x )k 2 < 1; 8 2 D F (p 0 ; p f ) := 1 2 (a) = Z Z p 0 (x ) kr x log p 0 (x ) r x log p f (x )k 2 dx log pf (x 2 ) p 0 (x ) 2 log p f (x ) i Z i=1 p 0 (x log p 2 i where partial integration is used in (a) under the condition that p 0 (x log p f (x i 2! 0 as x i! 1; 8 i = 1; : : : ; d! 12/39
17 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. 13/39
18 Empirical score matching p n represents n i.i.d. samples from P 0 D F (p n ; p f ) := 1 n nx dx a=1 i=1 1 log pf (X a i 2 2 log p f (X a 2 i! + C Since D F (p n ; p f ) is independent of A(f ), f n = arg min f 2F D F (p n ; p f ) should be easily computable, unlike the MLE. Add extra term kf k 2 H to regularize. 13/39
19 A kernel solution Infinite exponential log p f (x ) = p f (x ) = e hf ;'(x )i H A(f ) q 0 hf ; '(x )i log q 0(x ): 14/39
20 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j i '(X log q 0(x ): d+j k (X ; X 0 ) 14/39
21 A kernel solution Infinite exponential log p f (x ) = Kernel trick for derivatives: p f (x ) = e hf ;'(x )i H A(f ) q 0 i f (X hf ; '(x )i H + f ; Dot product between feature '(X ); '(X 0 j By representer theorem: f n = ^ + + i '(X ) dx = `=1 log q 0(x ): d+j k (X ; X 0 j 14/39
22 An RKHS solution The RKHS solution f n = ^ + nx dx `=1 j j Need to solve a linear system n = 1 0 G {z XX } nd nd + ni 1 C A 1 h X Very costly in high dimensions! f(x) x 15/39
23 The Nystrom approximation 16/39
24 Nystrom approach for efficient solution Find best estimator f n ;m in H Y := span f@ i k (y a ; )g ; where a2[m];i2[d] y a 2 fx i g n i=1 chosen at random. Nystrom solution: 0 1y n B ;m 1 n B XY > C B {z XY + G } {z YY } A md nd md md Solve in time O(nm 2 d 3 ), evaluate in time O(md): Sill cubic in d, but similar results if we take a random dimension per datapoint. h Y 17/39
25 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) 18/39
26 Consistency: original solution Define C as the covariance between feature derivatives. Then from [Sriperumbudur et al. JMLR (2017)] Rates of convergence: Suppose Then f 0 2 R(C ) for some > 0. = n max 1 2(+1) 3 ; 1 as n! 1: min 2 D F (p 0 ; p fn ) = O p0 n 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. 18/39
27 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) 19/39
28 Consistency: Nystrom solution Define C as the covariance between feature derivatives. Suppose f 0 2 R(C ) for some > 0. Number of subsampled points m = (n log n) for = (min(2; 1) + 2) = n max 1 Then 2(+1) 3 ; 1 3 ; 1 2 as n! 1: D F (p 0 ; p fn ;m ) = O p 0 n min 2 3 ; 2(+1) Convergence in other metrics: KL, Hellinger, L r ; 1 < r < 1. Same rate but saturates sooner. Full KL original saturates at O p0 n 1 2 Nystrom saturates at O p0 n /39
29 Experimental results: ring Sample: Score: /39
30 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39
31 Experimental results: grid of Gaussians Sample: Score: /39
32 Experimental results: comparison with autoencoder score d Comparison with regularized auto-encoders [Alain and Bengio (JMLR, 2014)] n=500 training points full nyström, m = 42 nyström, m = 167 dae, m = 100 dae, m = /39
33 The kernel conditional exponential family 24/39
34 The kernel conditional exponential family Can we take advantage of the graphical structure of (X 1 ; :::; X d )? Start from a general factorization of P P (X 1 ; :::; X d ) = Y i P (X i j X (i) {z } parents of X i ) Estimate each factor independently 25/39
35 Kernel conditional exponential family General definition, kernel conditional exponential family [Smola and Canu, 2006] p f (yjx ) = e hf ; (x ;y)i H A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hf ; (x ;y)ih dy (joint feature map (x ; y)) 26/39
36 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. 27/39
37 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H 28/39
38 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G = hf ; x (y)i H x : H! G is a linear operator 28/39
39 Kernel conditional exponential family Our kernel conditional exponential family: p f (x ) = e hfx ;(y)i G A(f ;x ) q 0 (y) A(f ; x ) = log Z q o (y)e hfx ;(y)i G linear in the sufficient statistic (y) 2 G. What does this RKHS look like? [Micchelli and Pontil, (2005)] hf x ; (y)i G = h x f ; (y)i G x : G! H is a linear = hf ; x (y)i H operator. The feature map (x ; y) := x (y) 28/39
40 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. 29/39
41 What is our loss function? The obvious approach: minimise D F [p 0 (x )p 0 (yjx )kp f (x )p f (yjx )] Problem: the expression still contains R p 0 (yjx )dy. Our loss function: ed F (p 0 ; p f ) := Z D F (p 0 (yjx )jjp f (yjx ))(x ) for some (x ) that includes the support of p(x ): 29/39
42 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) 30/39
43 Finite sample estimate of the conditional density Use the simplest operator-valued RKHS x = I G k (x ; ). x : G! H x (y) 7! (y)k (x ; ) Solution: where f n (yjx ) = nx dx b=1 i=1 n 1 = (b;i)k (X b ; x )@ i K(Y b ; y) + ^ (G + ni ) 1 h (G) (a ;i);(b;j ) =k (X a ; X b )@ j +d K(Y a ; Y b ); and h(y); (y 0 )i G = K(y; y 0 ). 30/39
44 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39
45 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39
46 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39
47 Expected conditional score: a failure case P (Y jx = 1) P (Y jx = 1) P (Y ) = 1 2 (P (Y jx = 1) + P (Y jx = 1)) ed F (p(yjx ) {z } target {z } model ; p(y) ) = 0 31/39
48 Expected conditional score: a failure case Why does it fail? Recall ed F (p 0 (yjx ); p f (yjx )) := Z (x )D F (p 0 (yjx ); p f (yjx )) Note that D F (p(yjx = 1) {z } target {z } model ; p(y) ) = Z p(yj1) kr x log p(yj1) r x log p(y)k 2 dy Model p(y) puts mass where target conditional p(yj1) has no support. Care needed when this failure mode approached! 32/39
49 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Parkinsons Red Wine Dimension Samples /39
50 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] 34/39
51 Unconditional vs conditional model in practice Red Wine: Physiochemical measurements on wine samples. Parkinsons: Biomedical voice measurements from patients with early stage Parkinson s disease. Comparison with LSCDE model: with consistency guarantees [Sugiyama et al., (2010)] RNADE model: mixture models with deep features of parents, no guarantees [Uria et al. (2016)] Negative log likelihoods (smaller is better, average over 5 test/train splits) Parkinsons Red wine KCEF 2:86 0:77 11:8 0:93 LSCDE 15:89 1:48 14:43 1:5 NADE 3:63 0:0 9:98 0:0 34/39
52 Results: unconditional model Red Wine Parkinsons 6 Data KEF Data KEF x x6 x x15 35/39
53 Results: conditional model x Red Wine Data KCEF x6 x Parkinsons 1 1 x15 Data KCEF 36/39
54 Co-authors From Gatsby: Michael Arbel Heiko Strathmann Dougal Sutherland External collaborators: Kenji Fukumizu Bharath Sriperumbudur Questions? 37/39
55 Score matching: 1-D proof D F (p 0 ; p f ) = 1 2 Z b a p 0 (x ) d log p0 (x ) d log p f (x 2 ) 38/39
56 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a 38/39
57 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39
58 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39
59 Score matching: 1-D proof D F (p 0 ; p f ) 1 Z b d log p0 (x ) d log p f (x 2 ) = p 0 (x ) 2 a 1 Z b d log p0 (x 2 ) = p 0 (x ) 1 Z b d log pf (x 2 ) + p 0 (x ) 2 a 2 a Z b d log pf (x ) d log p0 (x ) p 0 (x ) a Final term: Z b d log pf (x ) p 0 (x ) d log pf (x ) 1 a Z b = a p 0 (x ) = d log pf (x ) p 0 (x ) d log p0 (x ) b a p 0 (x ) Z b a dp 0 (x )! p 0 (x ) d 2 log p f (x ) 2 : 38/39
60 39/39
Adaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationKernel Adaptive Metropolis-Hastings
Kernel Adaptive Metropolis-Hastings Arthur Gretton,?? Gatsby Unit, CSML, University College London NIPS, December 2015 Arthur Gretton (Gatsby Unit, UCL) Kernel Adaptive Metropolis-Hastings 12/12/2015 1
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationp(z)
Chapter Statistics. Introduction This lecture is a quick review of basic statistical concepts; probabilities, mean, variance, covariance, correlation, linear regression, probability density functions and
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationBayesian Interpretations of Regularization
Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationGaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford
Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford Carol Hsin Abstract The objective of this project is to return expected electricity
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationProbabilistic Graphical Models for Image Analysis - Lecture 1
Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationIntroduction to Probabilistic Graphical Models: Exercises
Introduction to Probabilistic Graphical Models: Exercises Cédric Archambeau Xerox Research Centre Europe cedric.archambeau@xrce.xerox.com Pascal Bootcamp Marseille, France, July 2010 Exercise 1: basics
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationCausal Modeling with Generative Neural Networks
Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationRobust Low Rank Kernel Embeddings of Multivariate Distributions
Robust Low Rank Kernel Embeddings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.edu, bodai@gatech.edu Abstract Kernel embedding of
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationThe Gaussian distribution
The Gaussian distribution Probability density function: A continuous probability density function, px), satisfies the following properties:. The probability that x is between two points a and b b P a
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationParameter estimation Conditional risk
Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationMachine Learning (CS 567) Lecture 5
Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly
More informationThe Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)
The Poisson transform for unnormalised statistical models Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB) Part I Unnormalised statistical models Unnormalised statistical models
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationdt 2 The Order of a differential equation is the order of the highest derivative that occurs in the equation. Example The differential equation
Lecture 18 : Direction Fields and Euler s Method A Differential Equation is an equation relating an unknown function and one or more of its derivatives. Examples Population growth : dp dp = kp, or = kp
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationNonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group
Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian
More information